Introduction to 
Applied Linear Algebra 


Vectors, Matrices, and Least Squares 


Stephen Boyd 


Department of Electrical Engineering 
Stanford University 


Lieven Vandenberghe 


Department of Electrical and Computer Engineering 
University of California, Los Angeles 


E] CAMBRIDGE 


Q UNIVERSITY PRESS 


CAMBRIDGE 

UNIVERSITY PRESS 

University Printing House, Cambridge CB2 8BS, United Kingdom 
One Liberty Plaza, 20th Floor, New York, NY 10006, USA 

477 Williamstown Road, Port Melbourne, VIC 3207, Australia 


314-321, 3rd Floor, Plot 3, Splendor Forum, Jasola District Centre, 
New Delhi — 110025, India 


79 Anson Road, #406-04/06, Singapore 079906 


Cambridge University Press is part of the University of Cambridge. 


It furthers the University’s mission by disseminating knowledge in the pursuit of 
education, learning, and research at the highest international levels of excellence. 


www.cambridge.org 
Information on this title: www.cambridge.org/9781316518960 
DOI: 10.1017/9781108583664 


© Cambridge University Press 2018 


This publication is in copyright. Subject to statutory exception 
and to the provisions of relevant collective licensing agreements, 
no reproduction of any part may take place without the written 
permission of Cambridge University Press. 


First published 2018 

Printed in the United Kingdom by Clays, St Ives plc, 2018 

A catalogue record for this publication is available from the British Library. 
ISBN 978-1-316-51896-0 Hardback 

Additional resources for this publication at www.cambridge.org/IntroAppLinAlg 


Cambridge University Press has no responsibility for the persistence or accuracy 
of URLs for external or third-party internet websites referred to in this publication 
and does not guarantee that any content on such websites is, or will remain, 
accurate or appropriate. 


For 


Anna, Nicholas, and Nora 


Daniël and Margriet 


Contents 


Preface 


Vectors 

Vectors 

ET Veto ce: ct e ek dele de et es Bas SORE Be Ae we eS an ts 
1.2 Netter addition > s e Tey ea 3b e54 he 88828858444 hes 
1.3 Scalar-vector multiplication... o.o o a 
LA. lane podi -eare aoan aa ee ee E aa a a E A 
1.5 Complexity of vector computations . . . . ooo a 
EXSIGISES o cn RR RE a E Era oa dkak eo 
Linear functions 

21 Linear TUHCHIONS) -o a ni aaa a ae Ge Glo a aoe kc RH A a he 
22 “Tayler approximation <2. 2/5 ta eao ah n a a be Pes 
2.3 Regression model .. aaa aaa 
EXErGISES ee haa epee a eee ER ewe ee a T e e ae 
Norm and distance 

d- NORM: oo Sink ak ee RR BE Eee da a ae e wo 
G2  DiBranee ox: des & AG a A rae Rae E E OE hes 
3.3 Standard deviation... p s-e eeso aea ce ow es 
Oe ABIES. 2 hoe eh we Boh Ek de Hh eS 
SS) COMmpIGGy 2 Pisce ce ek het Gales Be Ee ee eee Eyes 
EXECS oe be Ry Be ew eB ee ee ee h 
Clustering 

41 Clustering ii aoa ok ee we Pk a a Be i a ee a 
4.2 A clustering objective .. 2... 
4.3. The k-means algorithm ........ 00.0.0. eee ee 
Oa PRAM ES ok bee he kh ee A Re Bee eae S e 
4.5 Applications -s ss 865508 $4504 Bea G eee ee ee DG 
EXOTICS. 6 cc RR ER ey we dk ak ee t iat 


xi 


viii Contents 
5 Linear independence 89 
5.1 Linear dependence... .........-. 000000 eee ee 89 
Be, BƏS e segi thle lee tee feo e Hk ah SK Bee a Ek ee ES 91 
5.3. Orthonormal vectors... soe c s rodo do daai kekes raaaa 95 
54 Gram-Schmidt algorithm : -<= =s srs 0.00022 eee eee 97 
EKSrCISES > 2b ose e a e ee Bee Ee ch eee Oe a EE TREE 103 

Il Matrices 105 
6 Matrices 107 
GL MEES ce age Se SA ee ae tel ee a a 107 
6.2 Zero and identity matrices .. oaoa ee 113 
6.3 Transpose, addition, and norm .... 2... 0. 2 ee ee ee 115 
6.4 Matrix-vector multiplication... ooa a 20020002 ae 118 
GS- Comp ef) eG eee a ee ee ed 122 
EX@RCISES -oi e es Sot ie dk ly ees Be aw ew EAR a ee A 124 

7 Matrix examples 129 
7.1 Geometric transformations . 2... 0. ee 129 
Te SCO: «oc a wc lee hk ee be ewe eee bo wee eo 131 
Ge IWeiderice Matik- s ecos 654648 428 4A ae are a ee A 132 
(A -Gotivolution: = 2.62668 bee 4 ee we Pw De ke ee ee es 136 
Exe rees Se ee WA we we we Re ae ea 144 

8 Linear equations 147 
8.1 Linear and affine fuMetions < s o ece koe k hw ew ee a ae 147 
8.2 Linear function models .. 2... 0... 150 
8.3 Systems of linear equations .. 2... 2.20.20. 2200222 eee 152 
EXGRCISES: kos ane ee ee ae he a Ee ke alg Bene Gd ke se ny es ee hed 159 

9 Linear dynamical systems 163 
9.1 Linear dynamical systems ... 2... aaa a 163 
9.2 Papulatiendynamics a he sci iaee ee eS ee ee Pe eee e a 164 
93 Epidemic dynamics... 2 oe ee yee hoa ee bee we ed 168 
9.4 Motion of a mass ... 2... 0... a 169 
9.5 Supply chain dynamics .. oaa aaa a 171 
Exerėisės os a ew me A a i 174 

10 Matrix multiplication 177 
10.1 Matrix-matrix multiplication .............0 2002000048 177 
10.2 Composition of linear functions .. os s as sose d orena oni ri 183 
10:3 Matris põwer ssi me domaco aop ee ee E ed 186 
10.4 QR factorization... o e c ce e cauere saag Ee Rae GES 189 


ETG ES a e Aa he AED a Ae gaai Sat BB ofa Sh dae eat ee kl ee i aE S 191 


Contents 


11 Matrix inverses 
11.1 Left and right inverses. sc: os caca ee 
112 WAVESE <2 hi ame ee God ee4 BES wee bee EER ES 
11.3 Solving linear equations .. 2... 
114 Examples... 05. 2c sos en ee RR A O E a 
1435. PsevdG4nvetse.. 2.6 a hy ee ee ee A ee bee ae * 
ExGiciseS) 242 Se Sabet eee BERS eke ERP Oe eee wee oS 


Ill Least squares 


12 Least squares 
12.1 Least squares problem... < s ove e spa doa 2000002 pees 
12> Solution oc ese ee ee eRe Re eae SO eS S 
12.3 Solving least squares problems .........-...-220022005 
124 Examples... i454 bee hee bea bP eRe eR Eee eS 
ERIE L 6 fc cS in he ee ths RBS hk le dee ee 


13 Least squares data fitting 
13.1 Least squares data fitting .. 2... ee 
A MaNeOW 65 kee ee ee Bee ee ee eee 
13.3 Feature engineering .. 2... 1 a 
EXSIGSGS yo eG OS AS He pow a E oa E a Pe Gra Pes 


14 Least squares classification 
14.1  Classiti@ationi.. 2.5.6 5 ee oh we ed ea Pe Pe ee Ee 
14.2 Least squares classifier... 2... 2. a a 
14.3 Multi-class classifiers 2... 0... k Moe a ae 
ExXGMeiseS 2 ko be he ek be OE OE ee eRe ee eS 


15 Multi-objective least squares 


15.1 Multi-objective least squares .... 2... 2... 2 ee ee 
15.2 Controlo s bokeh mee ee hae ee eae ee eee ea a 
15.3 Estimation and inversion .........- 000000 pe tees 
15.4 Regularized data fitting .. 2... 2... a 
15:5 CoOmPIESRY -ci def A ee Bh ee a ee Ee oe eB 
Exercises) sc cee Ge be EP ee ERP YS coo ee ee a eg 


16 Constrained least squares 
16.1 Constrained least squares problem ...............2200. 
162 SOW. = odo os eds e Ha a bee ee AO ee Ee ee 
16.3 Solving constrained least squares problems .......-....-.-..-.- 
EKSIGIS 2 oe te a ke eR AR A A Be oe a ee eo ee ws 


Contents 


17 Constrained least squares applications 357 
17.1 Portfolio optimization... << s s 4.4 440¢%6 4% be tS GREER EES 357 
17.2 Linear quadratic control... 2... a ee 366 
17.3 Linear quadratic state estimation... ...............0-.4 372 
EXErCISES oasis, bP ak he ROR A ee eye PA a a oe ee ade ae 378 

18 Nonlinear least squares 381 
18.1 Nonlinear equations and least squares .........-......0-. 381 
18.2 Gauss-Newton algorithm .. 1... 0.0... ee ee 386 
18.3 Levenberg—Marquardt algorithm ..................0-.4 391 
18.4 Nonlinear model fitting . 2... 2... ee 399 
18.5 Nonlinear least squares classification .. . . 0.0.0.0. ee ee 401 
EREE y he he PSK SEER ER OH ROP RO ee eS 412 

19 Constrained nonlinear least squares 419 
19.1 Constrained nonlinear least squares... 2... 2... ee ee 419 
19.2 ‘Penalty algorithit: -i s 4.8 64 2444444 Be eae kee aban eS 421 
19.3 Augmented Lagrangian algorithm... .......... 000000 0G 422 
19.4 Nonlinear control .. 2... 2.020002 a a E A a a 425 
Exercises’. a ec SA eee FERED BS EAR EEE SS 434 

Appendices 437 

A Notation 439 

B Complexity 441 

C Derivatives and optimization 443 
Col Dernvatives. nh ee Paw ee YA Bee Be ee eee ee 443 
C2 OPtMZBtION nk as eae oe ee Pi ee ee oR eS ele wad 447 
C3 Lagrange multipliers... 244462220444 oo 2 ad pohod a g eo 448 

D Further study 451 


Index 455 


Preface 


This book is meant to provide an introduction to vectors, matrices, and least 
squares methods, basic topics in applied linear algebra. Our goal is to give the 
beginning student, with little or no prior exposure to linear algebra, a good ground- 
ing in the basic ideas, as well as an appreciation for how they are used in many 
applications, including data fitting, machine learning and artificial intelligence, to- 
mography, navigation, image processing, finance, and automatic control systems. 


The background required of the reader is familiarity with basic mathematical 
notation. We use calculus in just a few places, but it does not play a critical 
role and is not a strict prerequisite. Even though the book covers many topics 
that are traditionally taught as part of probability and statistics, such as fitting 
mathematical models to data, no knowledge of or background in probability and 
statistics is needed. 


The book covers less mathematics than a typical text on applied linear algebra. 
We use only one theoretical concept from linear algebra, linear independence, and 
only one computational tool, the QR factorization; our approach to most applica- 
tions relies on only one method, least squares (or some extension). In this sense 
we aim for intellectual economy: With just a few basic mathematical ideas, con- 
cepts, and methods, we cover many applications. The mathematics we do present, 
however, is complete, in that we carefully justify every mathematical statement. 
In contrast to most introductory linear algebra texts, however, we describe many 
applications, including some that are typically considered advanced topics, like 
document classification, control, state estimation, and portfolio optimization. 


The book does not require any knowledge of computer programming, and can be 
used as a conventional textbook, by reading the chapters and working the exercises 
that do not involve numerical computation. This approach however misses out on 
one of the most compelling reasons to learn the material: You can use the ideas and 
methods described in this book to do practical things like build a prediction model 
from data, enhance images, or optimize an investment portfolio. The growing power 
of computers, together with the development of high level computer languages 
and packages that support vector and matrix computation, have made it easy to 
use the methods described in this book for real applications. For this reason we 
hope that every student of this book will complement their study with computer 
programming exercises and projects, including some that involve real data. This 
book includes some generic exercises that require computation; additional ones, 
and the associated data files and language-specific resources, are available online. 


xii 


Preface 


If you read the whole book, work some of the exercises, and carry out computer 
exercises to implement or use the ideas and methods, you will learn a lot. While 
there will still be much for you to learn, you will have seen many of the basic ideas 
behind modern data science and other application areas. We hope you will be 
empowered to use the methods for your own applications. 

The book is divided into three parts. Part I introduces the reader to vectors, 
and various vector operations and functions like addition, inner product, distance, 
and angle. We also describe how vectors are used in applications to represent word 
counts in a document, time series, attributes of a patient, sales of a product, an 
audio track, an image, or a portfolio of investments. Part II does the same for 
matrices, culminating with matrix inverses and methods for solving linear equa- 
tions. Part III, on least squares, is the payoff, at least in terms of the applications. 
We show how the simple and natural idea of approximately solving a set of over- 
determined equations, and a few extensions of this basic idea, can be used to solve 
many practical problems. 

The whole book can be covered in a 15 week (semester) course; a 10 week 
(quarter) course can cover most of the material, by skipping a few applications and 
perhaps the last two chapters on nonlinear least squares. The book can also be used 
for self-study, complemented with material available online. By design, the pace of 
the book accelerates a bit, with many details and simple examples in parts I and II, 
and more advanced examples and applications in part III. A course for students 
with little or no background in linear algebra can focus on parts I and II, and cover 
just a few of the more advanced applications in part II]. A more advanced course 
on applied linear algebra can quickly cover parts I and II as review, and then focus 
on the applications in part III, as well as additional topics. 

We are grateful to many of our colleagues, teaching assistants, and students 
for helpful suggestions and discussions during the development of this book and 
the associated courses. We especially thank our colleagues Trevor Hastie, Rob 
Tibshirani, and Sanjay Lall, as well as Nick Boyd, for discussions about data fitting 
and classification, and Jenny Hong, Ahmed Bou-Rabee, Keegan Go, David Zeng, 
and Jaehyun Park, Stanford undergraduates who helped create and teach the course 
EE103. We thank David Tse, Alex Lemon, Neal Parikh, and Julie Lancashire for 
carefully reading drafts of this book and making many good suggestions. 


Stephen Boyd Stanford, California 
Lieven Vandenberghe Los Angeles, California 
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1.1 


Chapter 1 


Vectors 


In this chapter we introduce vectors and some common operations on them. We 
describe some settings in which vectors are used. 


Vectors 


A vector is an ordered finite list of numbers. Vectors are usually written as vertical 
arrays, surrounded by square or curved brackets, as in 


ye -fi 
0.0 0.0 
36l F 3.6 
-7.2 -7.2 


They can also be written as numbers separated by commas and surrounded by 
parentheses. In this notation style, the vector above is written as 


(—1.1, 0.0, 3.6, —7.2). 


The elements (or entries, coefficients, components) of a vector are the values in the 
array. The size (also called dimension or length) of the vector is the number of 
elements it contains. The vector above, for example, has size four; its third entry 
is 3.6. A vector of size n is called an n-vector. A 1-vector is considered to be the 
same as a number, i.e., we do not distinguish between the 1-vector | 1.3 ] and the 
number 1.3. 

We often use symbols to denote vectors. If we denote an n-vector using the 
symbol a, the ith element of the vector a is denoted a;, where the subscript 7 is an 
integer index that runs from 1 to n, the size of the vector. 

Two vectors a and b are equal, which we denote a = b, if they have the same 
size, and each of the corresponding entries is the same. If a and b are n-vectors, 
then a = b means ay = b1, ..., Gn = bn- 
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The numbers or values of the elements in a vector are called scalars. We will 
focus on the case that arises in most applications, where the scalars are real num- 
bers. In this case we refer to vectors as real vectors. (Occasionally other types of 
scalars arise, for example, complex numbers, in which case we refer to the vector 
as a complex vector.) The set of all real numbers is written as R, and the set of all 
real n-vectors is denoted R”, so a € R” is another way to say that a is an n-vector 
with real entries. Here we use set notation: a € R” means that a is an element of 
the set R”; see appendix A. 


Block or stacked vectors. It is sometimes useful to define vectors by concatenat- 
ing or stacking two or more vectors, as in 


where a, b, c, and d are vectors. If b is an m-vector, c is an n-vector, and d is a 
p-vector, this defines the (m + n + p)-vector 


a= (by, bo,..., Om, C1, C2,- - , Cn, d1, do,..., dp). 


The stacked vector a is also written as a = (b,c, d). 
Stacked vectors can include scalars (numbers). For example if a is a 3-vector, 
(1,a) is the 4-vector (1, a1, a2, ag). 


Subvectors. In the equation above, we say that b, c, and d are subvectors or 
slices of a, with sizes m, n, and p, respectively. Colon notation is used to denote 
subvectors. If a is a vector, then a,., is the vector of size s — r+ 1, with entries 
Ging oa, Og! 


Oise = (pees 250s) 


The subscript r:s is called the index range. Thus, in our example above, we have 


b= Q1:ms C = Q(m4+1):(m+n)> d= Q(m+n4+1):(m+nt+p)- 


As a more concrete example, if z is the 4-vector (1, —1, 2,0), the slice 29.3 is 29.3 = 
(—1,2). Colon notation is not completely standard, but it is growing in popularity. 


Notational conventions. Some authors try to use notation that helps the reader 
distinguish between vectors and scalars (numbers). For example, Greek letters 
(a, B, ...) might be used for numbers, and lower-case letters (a, x, f, ...) for 
vectors. Other notational conventions include vectors given in bold font (g), or 
vectors written with arrows above them (@). These notational conventions are not 
standardized, so you should be prepared to figure out what things are (i.e., scalars 
or vectors) despite the author’s notational scheme (if any exists). 
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Indexing. We should give a couple of warnings concerning the subscripted index 
notation a;. The first warning concerns the range of the index. In many computer 
languages, arrays of length n are indexed from i = 0 toi = n — 1. But in standard 
mathematical notation, n-vectors are indexed from i = 1 to i = n, so in this book, 
vectors will be indexed from i= 1 to i = n. 

The next warning concerns an ambiguity in the notation a;, used for the ith 
element of a vector a. The same notation will occasionally refer to the ith vector 
in a collection or list of k vectors a,,...,a,. Whether az means the third element 
of a vector a (in which case a3 is a number), or the third vector in some list of 
vectors (in which case a3 is a vector) should be clear from the context. When we 
need to refer to an element of a vector that is in an indexed collection of vectors, 
we can write (a;); to refer to the jth entry of a;, the ith vector in our list. 


Zero vectors. A zero vector is a vector with all elements equal to zero. Sometimes 
the zero vector of size n is written as 0,, where the subscript denotes the size. 
But usually a zero vector is denoted just 0, the same symbol used to denote the 
number 0. In this case you have to figure out the size of the zero vector from the 
context. As a simple example, if a is a 9-vector, and we are told that a = 0, the 0 
vector on the right-hand side must be the one of size 9. 

Even though zero vectors of different sizes are different vectors, we use the same 
symbol 0 to denote them. In computer programming this is called overloading: 
The symbol 0 is overloaded because it can mean different things depending on the 
context (e.g., the equation it appears in). 


Unit vectors. A (standard) unit vector is a vector with all elements equal to zero, 
except one element which is equal to one. The ith unit vector (of size n) is the 
unit vector with ith element one, and denoted e;. For example, the vectors 


1 0 0 
a= 0 ’ e2 = 1 ; e3 = 0 
0 0 1 


are the three unit vectors of size 3. The notation for unit vectors is an example of 
the ambiguity in notation noted above. Here, e; denotes the ith unit vector, and 
not the ith element of a vector e. Thus we can describe the ith unit n-vector e; as 


um {o ie: 


for i,j =1,...,n. On the left-hand side e; is an n-vector; (e;); is a number, its jth 
entry. As with zero vectors, the size of e; is usually determined from the context. 


Ones vector. We use the notation 1,, for the n-vector with all its elements equal 
to one. We also write 1 if the size of the vector can be determined from the 
context. (Some authors use e to denote a vector of all ones, but we will not use 
this notation.) The vector 1 is sometimes called the ones vector. 
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T2 o 


T2 


Tı Tı 


Figure 1.1 Left. The 2-vector x specifies the position (shown as a dot) 
with coordinates xı and z2 in a plane. Right. The 2-vector x represents a 
displacement in the plane (shown as an arrow) by xı in the first axis and x2 
in the second. 


Sparsity. A vector is said to be sparse if many of its entries are zero; its sparsity 
pattern is the set of indices of nonzero entries. The number of the nonzero entries 
of an n-vector x is denoted nnz(x). Unit vectors are sparse, since they have only 
one nonzero entry. The zero vector is the sparsest possible vector, since it has no 
nonzero entries. Sparse vectors arise in many applications. 


Examples 


An n-vector can be used to represent n quantities or values in an application. In 
some cases the values are similar in nature (for example, they are given in the same 
physical units); in others, the quantities represented by the entries of the vector are 
quite different from each other. We briefly describe below some typical examples, 
many of which we will see throughout the book. 


Location and displacement. A 2-vector can be used to represent a position or 
location in a 2-dimensional (2-D) space, i.e., a plane, as shown in figure 1.1. A 
3-vector is used to represent a location or position of some point in 3-dimensional 
(3-D) space. The entries of the vector give the coordinates of the position or 
location. 

A vector can also be used to represent a displacement in a plane or 3-D space, 
in which case it is typically drawn as an arrow, as shown in figure 1.1. A vector can 
also be used to represent the velocity or acceleration, at a given time, of a point 
that moves in a plane or 3-D space. 


Color. A 3-vector can represent a color, with its entries giving the Red, Green, 
and Blue (RGB) intensity values (often between 0 and 1). The vector (0,0, 0) 
represents black, the vector (0,1,0) represents a bright pure green color, and the 
vector (1,0.5,0.5) represents a shade of pink. This is illustrated in figure 1.2. 
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(1, 0,0) (0, 1,0) (0, 0, 1) 
(1, 1,0) 


(1, 0.5, 0.5) (0.5, 0.5, 0.5) 
Figure 1.2 Six colors and their RGB vectors. 


Quantities. An n-vector q can represent the amounts or quantities of n different 
resources or products held (or produced, or required) by an entity such as a com- 
pany. Negative entries mean an amount of the resource owed to another party (or 
consumed, or to be disposed of). For example, a bill of materials is a vector that 
gives the amounts of n resources required to create a product or carry out a task. 


Portfolio. An n-vector s can represent a stock portfolio or investment in n dif- 
ferent assets, with s; giving the number of shares of asset i held. The vector 
(100, 50, 20) represents a portfolio consisting of 100 shares of asset 1, 50 shares of 
asset 2, and 20 shares of asset 3. Short positions (i.e., shares that you owe another 
party) are represented by negative entries in a portfolio vector. The entries of the 
portfolio vector can also be given in dollar values, or fractions of the total dollar 
amount invested. 


Values across a population. An n-vector can give the values of some quantity 
across a population of individuals or entities. For example, an n-vector b can 
give the blood pressure of a collection of n patients, with b; the blood pressure of 
patient i, fori =1,...,n. 


Proportions. A vector w can be used to give fractions or proportions out of n 
choices, outcomes, or options, with w; the fraction with choice or outcome i. In 
this case the entries are nonnegative and add up to one. Such vectors can also be 
interpreted as the recipes for a mixture of n items, an allocation across n entities, 
or as probability values in a probability space with n outcomes. For example, a 
uniform mixture of 4 outcomes is represented as the 4-vector (1/4, 1/4, 1/4, 1/4). 


Time series. An n-vector can represent a time series or signal, that is, the value 
of some quantity at different times. (The entries in a vector that represents a time 
series are sometimes called samples, especially when the quantity is something 
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Figure 1.3 Hourly temperature in downtown Los Angeles on August 5 and 
6, 2015 (starting at 12:47AM, ending at 11:47PM). 


measured.) An audio (sound) signal can be represented as a vector whose entries 
give the value of acoustic pressure at equally spaced times (typically 48000 or 44100 
per second). A vector might give the hourly rainfall (or temperature, or barometric 
pressure) at some location, over some time period. When a vector represents a time 
series, it is natural to plot x; versus į with lines connecting consecutive time series 
values. (These lines carry no information; they are added only to make the plot 
easier to understand visually.) An example is shown in figure 1.3, where the 48- 
vector x gives the hourly temperature in downtown Los Angeles over two days. 


Daily return. A vector can represent the daily return of a stock, i.e., its fractional 
increase (or decrease if negative) in value over the day. For example the return time 
series vector (—0.022, +0.014, +0.004) means the stock price went down 2.2% on 
the first day, then up 1.4% the next day, and up again 0.4% on the third day. In 
this example, the samples are not uniformly spaced in time; the index refers to 
trading days, and does not include weekends or market holidays. A vector can 
represent the daily (or quarterly, hourly, or minute-by-minute) value of any other 
quantity of interest for an asset, such as price or volume. 


Cash flow. A cash flow into and out of an entity (say, a company) can be repre- 
sented by a vector, with positive entries representing payments to the entity, and 
negative entries representing payments by the entity. For example, with entries 
giving cash flow each quarter, the vector (1000, —10, —10, —10,—1010) represents 
a one year loan of $1000, with 1% interest only payments made each quarter, and 
the principal and last interest payment at the end. 
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0.65 0.05 0.20 


0.28 0.00 0.90 


Figure 1.4 8 x 8 image and the grayscale levels at six pixels. 


Images. A monochrome (black and white) image is an array of M x N pixels 
(square patches with uniform grayscale level) with M rows and N columns. Each 
of the MN pixels has a grayscale or intensity value, with 0 corresponding to black 
and 1 corresponding to bright white. (Other ranges are also used.) An image can 
be represented by a vector of length M N, with the elements giving grayscale levels 
at the pixel locations, typically ordered column-wise or row-wise. 

Figure 1.4 shows a simple example, an 8x8 image. (This is a very low resolution; 
typical values of M and N are in the hundreds or thousands.) With the vector 
entries arranged row-wise, the associated 64-vector is 


x = (0.65, 0.05, 0.20, . . . , 0.28, 0.00, 0.90). 


A color M x N pixel image is described by a vector of length 3M N, with the 
entries giving the R, G, and B values for each pixel, in some agreed-upon order. 


Video. A monochrome video, i.e., a sequence of length K of images with M x N 
pixels, can be represented by a vector of length KMN (again, in some particular 
order). 


Word count and histogram. A vector of length n can represent the number of 
times each word in a dictionary of n words appears in a document. For example, 
(25,2,0) means that the first dictionary word appears 25 times, the second one 
twice, and the third one not at all. (Typical dictionaries used for document word 
counts have many more than 3 elements.) A small example is shown in figure 1.5. A 
variation is to have the entries of the vector give the histogram of word frequencies 
in the document, so that, e.g., xs = 0.003 means that 0.3% of all the words in the 
document are the fifth word in the dictionary. 

It is common practice to count variations of a word (say, the same word stem 
with different endings) as the same word; for example, ‘rain’, ‘rains’, ‘raining’, and 
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Word count vectors are used in computer based document analysis. 
Each entry of the word count vector is the number of times the as- 
sociated dictionary word appears in the document. 


word 

in 
number 
horse 

the 
document 


OP Or NW W 


Figure 1.5 A snippet of text (top), the dictionary (bottom left), and word 
count vector (bottom right). 


‘rained’ might all be counted as ‘rain’. Reducing each word to its stem is called 
stemming. It is also common practice to exclude words that are too common (such 
as ‘a’ or ‘the’), which are referred to as stop words, as well as words that are 
extremely rare. 


Customer purchases. An n-vector p can be used to record a particular customer’s 
purchases from some business over some period of time, with p; the quantity of 
item i the customer has purchased, for i = 1,...,n. (Unless n is small, we would 
expect many of these entries to be zero, meaning the customer has not purchased 
those items.) In one variation, p; represents the total dollar value of item i the 
customer has purchased. 


Occurrence or subsets. An n-vector o can be used to record whether or not 
each of n different events has occurred, with o; = 0 meaning that event i did not 
occur, and o; = 1 meaning that it did occur. Such a vector encodes a subset of 
a collection of n objects, with o; = 1 meaning that object 7 is contained in the 
subset, and o; = 0 meaning that object 7 is not in the subset. Each entry of the 
vector a is either 0 or 1; such vectors are called Boolean, after the mathematician 
George Boole, a pioneer in the study of logic. 


Features or attributes. In many applications a vector collects together n different 
quantities that pertain to a single thing or object. The quantities can be measure- 
ments, or quantities that can be measured or derived from the object. Such a 
vector is sometimes called a feature vector, and its entries are called the features 
or attributes. For example, a 6-vector f could give the age, height, weight, blood 
pressure, temperature, and gender of a patient admitted to a hospital. (The last 
entry of the vector could be encoded as fg = 0 for male, fg = 1 for female.) In this 
example, the quantities represented by the entries of the vector are quite different, 
with different physical units. 


1.2 
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Vector entry labels. In applications such as the ones described above, each entry 
of a vector has a meaning, such as the count of a specific word in a document, the 
number of shares of a specific stock held in a portfolio, or the rainfall in a specific 
hour. It is common to keep a separate list of labels or tags that explain or annotate 
the meaning of the vector entries. As an example, we might associate the portfolio 
vector (100,50, 20) with the list of ticker symbols (AAPL, INTC, AMZN), so we 
know that assets 1, 2, and 3 are Apple, Intel, and Amazon. In some applications, 
such as an image, the meaning or ordering of the entries follow known conventions 
or standards. 


Vector addition 


Two vectors of the same size can be added together by adding the corresponding 
elements, to form another vector of the same size, called the sum of the vectors. 
Vector addition is denoted by the symbol +. (Thus the symbol + is overloaded 
to mean scalar addition when scalars appear on its left- and right-hand sides, and 
vector addition when vectors appear on its left- and right-hand sides.) For example, 


0 1 1 
[+] 2 |=] 9 
3 0 3 


ea Ws atl ee fee 
9 1] | 8] 
The result of vector subtraction is called the difference of the two vectors. 


Properties. Several properties of vector addition are easily verified. For any vec- 
tors a, b, and c of the same size we have the following. 


e Vector addition is commutative: a + b = b + a. 


e Vector addition is associative: (a + b) +c = a + (b + c). We can therefore 
write both as a +b + c. 


e a+0=0+a= a. Adding the zero vector to a vector has no effect. (This 
is an example where the size of the zero vector follows from the context: It 
must be the same as the size of a.) 


e a—a= 0. Subtracting a vector from itself yields the zero vector. (Here too 
the size of 0 is the size of a.) 


To show that these properties hold, we argue using the definition of vector 
addition and vector equality. As an example, let us show that for any n-vectors a 
and b, we have a +b =b +a. The ith entry of a + b is, by the definition of vector 
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a+b 


Figure 1.6 Left. The lower blue arrow shows the displacement a; the dis- 
placement b, shown as the shorter blue arrow, starts from the head of the 
displacement a and ends at the sum displacement a + b, shown as the red 
arrow. Right. The displacement b + a. 


addition, a; + bi. The ith entry of b+ a is b; + a;. For any two numbers we have 
a; + bi = bi + a;, so the ith entries of the vectors a + b and b + are the same. 
This is true for all of the entries, so by the definition of vector equality, we have 
a+b=b+a. 

Verifying identities like the ones above, and many others we will encounter 
later, can be tedious. But it is important to understand that the various properties 
we will list can be derived using elementary arguments like the one above. We 
recommend that the reader select a few of the properties we will see, and attempt 
to derive them, just to see that it can be done. (Deriving all of them is overkill.) 


Examples. 


e Displacements. When vectors a and b represent displacements, the sum a+b 
is the net displacement found by first displacing by a, then displacing by b, 
as shown in figure 1.6. Note that we arrive at the same vector if we first 
displace by b and then a. If the vector p represents a position and the vector 
a represents a displacement, then p+a is the position of the point p, displaced 
by a, as shown in figure 1.7. 


e Displacements between two points. If the vectors p and q represent the posi- 
tions of two points in 2-D or 3-D space, then p—q is the displacement vector 
from q to p, as illustrated in figure 1.8. 


e Word counts. If a and b are word count vectors (using the same dictionary) 
for two documents, the sum a+) is the word count vector of a new document 
created by combining the original two (in either order). The word count 
difference vector a — b gives the number of times more each word appears in 
the first document than the second. 


e Bill of materials. Suppose qi,...,qn are n-vectors that give the quantities of 
n different resources required to accomplish N tasks. Then the sum n-vector 
qı +--+ + wn gives the bill of materials for completing all N tasks. 


1.2 Vector addition 
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pta 


Figure 1.7 The vector p + a is the position of the point represented by p 
displaced by the displacement represented by a. 


Figure 1.8 The vector p — q represents the displacement from the point 
represented by q to the point represented by p. 
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e Market clearing. Suppose the n-vector qi represents the quantities of n 
goods or resources produced (when positive) or consumed (when negative) 
by agent i, for i = 1,...,N, so (q5)4 = —3.2 means that agent 5 consumes 
3.2 units of resource 4. The sum s = qi +: +- +q is the n-vector of total net 
surplus of the resources (or shortfall, when the entries are negative). When 
s = 0, we have a closed market, which means that the total quantity of each 
resource produced by the agents balances the total quantity consumed. In 
other words, the n resources are exchanged among the agents. In this case 
we say that the market clears (with the resource vectors qi,...,qN)- 


e Audio addition. When a and b are vectors representing audio signals over 
the same period of time, the sum a+ 0 is an audio signal that is perceived as 
containing both audio signals combined into one. If a represents a recording of 
a voice, and b a recording of music (of the same length), the audio signal a+b 
will be perceived as containing both the voice recording and, simultaneously, 
the music. 


e Feature differences. If f and g are n-vectors that give n feature values for two 
items, the difference vector d = f — g gives the difference in feature values for 
the two objects. For example, d7 = 0 means that the two objects have the 
same value for feature 7; da = 1.67 means that the first object’s third feature 
value exceeds the second object’s third feature value by 1.67. 


e Time series. If a and b represent time series of the same quantity, such as 
daily profit at two different stores, then a+b represents a time series which is 
the total daily profit at the two stores. An example (with monthly rainfall) 
is shown in figure 1.9. 


e Portfolio trading. Suppose s is an n-vector giving the number of shares of n 
assets in a portfolio, and b is an n-vector giving the number of shares of the 
assets that we buy (when b; is positive) or sell (when b; is negative). After 
the asset purchases and sales, our portfolio is given by s + b, the sum of the 
original portfolio vector and the purchase vector b, which is also called the 
trade vector or trade list. (The same interpretation works when the portfolio 
and trade vectors are given in dollar value.) 


Addition notation in computer languages. Some computer languages for manip- 
ulating vectors define the sum of a vector and a scalar as the vector obtained by 
adding the scalar to each element of the vector. This is not standard mathematical 
notation, however, so we will not use it. Even more confusing, in some computer 
languages the plus symbol is used to denote concatenation of arrays, which means 
putting one array after another, as in (1,2) + (3,4,5) = (1,2,3,4,5). While this 
notation might give a valid expression in some computer languages, it is not stan- 
dard mathematical notation, and we will not use it in this book. In general, it is 
very important to distinguish between mathematical notation for vectors (which 
we use) and the syntax of specific computer languages or software packages for 
manipulating vectors. 


1.3 
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Figure 1.9 Average monthly rainfall in inches measured in downtown Los 
Angeles and San Francisco International Airport, and their sum. Averages 
are 30-year averages (1981-2010). 


Scalar-vector multiplication 


Another operation is scalar multiplication or scalar-vector multiplication, in which 
a vector is multiplied by a scalar (i.e., number), which is done by multiplying 
every element of the vector by the scalar. Scalar multiplication is denoted by 
juxtaposition, typically with the scalar on the left, as in 


1 —2 
(-2)| 9 | = | -18 
6 —12 


Scalar-vector multiplication can also be written with the scalar on the right, as in 


1 1.5 
9 | (1.5) = | 13.5 
6 9 


The meaning is the same: It is the vector obtained by multiplying each element 
by the scalar. A similar notation is a/2, where a is a vector, meaning (1/2)a. The 
scalar-vector product (—1)a is written simply as —a. Note that 0a = 0 (where the 
left-hand zero is the scalar zero, and the right-hand zero is a vector zero of the 
same size as a). 


Properties. By definition, we have aa = aa, for any scalar a and any vector a. 
This is called the commutative property of scalar-vector multiplication; it means 
that scalar-vector multiplication can be written in either order. 
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Scalar multiplication obeys several other laws that are easy to figure out from 
the definition. For example, it satisfies the associative property: If a is a vector 
and @ and y are scalars, we have 


(By)a = B(ya). 


On the left-hand side we see scalar-scalar multiplication (Gy) and scalar-vector 
multiplication; on the right-hand side we see two scalar-vector products. As a 
consequence, we can write the vector above as (ya, since it does not matter whether 
we interpret this as 6(ya) or (8y)a. 

The associative property holds also when we denote scalar-vector multiplication 
with the scalar on the right. For example, we have 3(ya) = (8a)7, and consequently 
we can write both as Gay. As a convention, however, this vector is normally written 


as Bya or as (By)a. 
If a is a vector and ĝ, y are scalars, then 


(8+ 7)a= Ba + qa. 


(This is the left-distributive property of scalar-vector multiplication.) Scalar mul- 
tiplication, like ordinary multiplication, has higher precedence in equations than 
vector addition, so the right-hand side here, 8a+-ya, means (Ba) + (ya). It is useful 
to identify the symbols appearing in this formula above. The + symbol on the left 
is addition of scalars, while the + symbol on the right denotes vector addition. 
When scalar multiplication is written with the scalar on the right, we have the 
right-distributive property: 


a(B+7) =aß + ay. 


Scalar-vector multiplication also satisfies another version of the right-distributive 
property: 
B(a + b) = pa + Bb 
for any scalar 6 and any n-vectors a and b. In this equation, both of the + symbols 
refer to the addition of n-vectors. 


Examples. 


e Displacements. When a vector a represents a displacement, and 8 > 0, Ga 
is a displacement in the same direction of a, with its magnitude scaled by £. 
When (6 < 0, Ba represents a displacement in the opposite direction of a, 
with magnitude scaled by |8|. This is illustrated in figure 1.10. 


e Materials requirements. Suppose the n-vector q is the bill of materials for 
producing one unit of some product, i.e., q; is the amount of raw material 
required to produce one unit of product. To produce a units of the product 
will then require raw materials given by ag. (Here we assume that a > 0.) 


e Audio scaling. If a is a vector representing an audio signal, the scalar-vector 
product (a is perceived as the same audio signal, but changed in volume 
(loudness) by the factor |8|. For example, when 6 = 1/2 (or 8 = —1/2), Ba 
is perceived as the same audio signal, but quieter. 
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=1.5a 


Figure 1.10 The vector 0.75a represents the displacement in the direction of 
the displacement a, with magnitude scaled by 0.75; (—1.5)a represents the 
displacement in the opposite direction, with magnitude scaled by 1.5. 


Linear combinations. If a ,..., a, are n-vectors, and ĝ1,..., Bm are scalars, the 
n-vector 
Bay Tern mam 


is called a linear combination of the vectors a1,...,a@n. The scalars 61,..., 8m are 
called the coefficients of the linear combination. 


Linear combination of unit vectors. We can write any n-vector b as a linear 
combination of the standard unit vectors, as 


b = biei +--+ + bnen. (1.1) 


In this equation b; is the ith entry of b (i.e., a scalar), and e; is the ith unit vector. 
In the linear combination of e1,..., en given in (1.1), the coefficients are the entries 
of the vector b. A specific example is 


—1 1 0 0 

3 =(-1)} 0 | +3| 1 |+5] 0 

5 0 0 1 
Special linear combinations. Some linear combinations of the vectors a1,..., am 
have special names. For example, the linear combination with 6; =---= Bm = 1, 
given by a1 +--+ 4am, is the sum of the vectors, and the linear combination with 
By =- = Bm = 1/m, given by (1/m)(a1 +--++a@m), is the average of the vectors. 


When the coefficients sum to one, i.e., 61 ++- + Bm = 1, the linear combination 
is called an affine combination. When the coefficients in an affine combination are 
nonnegative, it is called a convex combination, a mixture, or a weighted average. The 
coefficients in an affine or convex combination are sometimes given as percentages, 
which add up to 100%. 
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Figure 1.11 Left. Two 2-vectors a; and az. Right. The linear combination 
b = 0.75a1 + 1.5a2. 


Examples. 


e Displacements. When the vectors represent displacements, a linear combina- 
tion is the sum of the scaled displacements. This is illustrated in figure 1.11. 


e Audio mizing. When aj,...,@m are vectors representing audio signals (over 
the same period of time, for example, simultaneously recorded), they are 
called tracks. The linear combination 6,a; +---+ Bmam is perceived as a 
mixture (also called a miz) of the audio tracks, with relative loudness given 
by |61|, ---, |G@m|. A producer in a studio, or a sound engineer at a live show, 
chooses values of (1,..., Gm to give a good balance between the different 
instruments, vocals, and drums. 


e Cash flow replication. Suppose that c1, ...,Cm are vectors that represent cash 
flows, such as particular types of loans or investments. The linear combination 
f = Pie, +---+8mcem represents another cash flow. We say that the cash 
flow f has been replicated by the (linear combination of the) original cash 
flows ci,...,Cm. As an example, cı = (1,—1.1,0) represents a $1 loan from 
period 1 to period 2 with 10% interest, and cp = (0,1,—1.1) represents a $1 
loan from period 2 to period 3 with 10% interest. The linear combination 


d = cı + 1.1c2 = (1,0, —1.21) 


represents a two period loan of $1 in period 1, with compounded 10% interest. 
Here we have replicated a two period loan from two one period loans. 


e Line and segment. When a and b are different n-vectors, the affine combi- 
nation c = (1 — 0)a + 0b, where @ is a scalar, describes a point on the line 
passing through a and b. When 0 < 0 < 1, c is a convex combination of a 
and b, and is said to lie on the segment between a and b. For n = 2 and 
n = 3, with the vectors representing coordinates of 2-D or 3-D points, this 
agrees with the usual geometric notion of line and segment. But we can also 
talk about the line passing through two vectors of dimension 100. This is 
illustrated in figure 1.12. 


1.4 


1.4 Inner product 


0 = —0.4 


Figure 1.12 The affine combination (1 — 0)a + 0b for different values of 0. 
These points are on the line passing through a and b; for 0 between 0 and 1, 
the points are on the line segment between a and b. 


Inner product 


The (standard) inner product (also called dot product) of two n-vectors is defined 
as the scalar 
ab = ayby + azb2 +-+: + anbn, 


the sum of the products of corresponding entries. (The origin of the superscript 
‘T? in the inner product notation afb will be explained in chapter 6.) Some other 
notations for the inner product (that we will not use in this book) are (a,b), (ald), 
(a,b), and a-b. (In the notation used in this book, (a,b) denotes a stacked vector 
of length 2n.) As you might guess, there is also a vector outer product, which we 
will encounter later, in §10.1. As a specific example of the inner product, we have 


T 


—1 1 
: = (—1)(1) + (2)(0) + (2)(-3) = -7 


When n = 1, the inner product reduces to the usual product of two numbers. 


Properties. The inner product satisfies some simple properties that are easily 
verified from the definition. If a, b, and c are vectors of the same size, and y is a 
scalar, we have the following. 


e Commutativity. aTb = bTa. The order of the two vector arguments in the 
inner product does not matter. 


e Associativity with scalar multiplication. (ya)"b = y(a™b), so we can write 
both as ya" b. 


e Distributivity with vector addition. (a+b)?c =a?c+b"c. The inner product 
can be distributed across vector addition. 


These can be combined to obtain other identities, such as af (yb) = y(a™b), or 
aT (b + yc) = afb + yafe. As another useful example, we have, for any vectors 
a,b,c,d of the same size, 


(a+b)"(c+d) =a? e+a'd+b'c+0'd. 
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This formula expresses an inner product on the left-hand side as a sum of four 
inner products on the right-hand side, and is analogous to expanding a product of 
sums in algebra. Note that on the left-hand side, the two addition symbols refer to 
vector addition, whereas on the right-hand side, the three addition symbols refer 
to scalar (number) addition. 


General examples. 


e Unit vector. eTa = a;. The inner product of a vector with the ith standard 


unit vector gives (or ‘picks out’) the ith element a. 


e Sum. 17a =a, +-:-+ an. The inner product of a vector with the vector of 
ones gives the sum of the elements of the vector. 


e Average. (1/n)?a = (a, +---+an)/n. The inner product of an n-vector with 
the vector 1/n gives the average or mean of the elements of the vector. The 
average of the entries of a vector is denoted by avg(x). The Greek letter u 
is a traditional symbol used to denote the average or mean. 


e Sum of squares. aTa = aj +--+ a2. The inner product of a vector with 
itself gives the sum of the squares of the elements of the vector. 


e Selective sum. Let b be a vector all of whose entries are either 0 or 1. Then 
bTa is the sum of the elements in a for which b; = 1. 


Block vectors. If the vectors a and b are block vectors, and the corresponding 
blocks have the same sizes (in which case we say they conform), then we have 


P 
a1 by 


oS | : > | Sapby +: + agb. 


ak bk 


The inner product of block vectors is the sum of the inner products of the blocks. 


Applications. The inner product is useful in many applications, a few of which 
we list here. 


e Co-occurrence. If a and b are n-vectors that describe occurrence, i.e., each 
of their elements is either 0 or 1, then afb gives the total number of indices 
for which a; and b; are both one, that is, the total number of co-occurrences. 
If we interpret the vectors a and b as describing subsets of n objects, then 
a’ b gives the number of objects in the intersection of the two subsets. This 
is illustrated in figure 1.13, for two subsets A and B of 7 objects, labeled 
1,...,7, with corresponding occurrence vectors 


a=(0,1,1,1,1,1,1), b=(1,0,1,0,1,0,0). 


Here we have afb = 2, which is the number of objects in both A and B (i.e., 
objects 3 and 5). 
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Figure 1.13 Two sets A and B, containing seven objects. 


Weights, features, and score. When the vector f represents a set of features of 
an object, and w is a vector of the same size (often called a weight vector), the 
inner product wf f is the sum of the feature values, scaled (or weighted) by 
the weights, and is sometimes called a score. For example, if the features are 
associated with a loan applicant (e.g., age, income, ...), we might interpret 
s = wf f as a credit score. In this example we can interpret w; as the weight 
given to feature 2 in forming the score. 


Price-quantity. If p represents a vector of prices of n goods, and q is a vector 
of quantities of the n goods (say, the bill of materials for a product), then 
their inner product p’q is the total cost of the goods given by the vector q. 


Speed-time. A vehicle travels over n segments with constant speed in each 
segment. Suppose the n-vector s gives the speed in the segments, and the 
n-vector t gives the times taken to traverse the segments. Then sft is the 
total distance traveled. 


Probability and expected values. Suppose the n-vector p has nonnegative 
entries that sum to one, so it describes a set of proportions among n items, 
or a set of probabilities of n outcomes, one of which must occur. Suppose 
f is another n-vector, where we interpret f; as the value of some quantity if 
outcome i occurs. Then fTp gives the expected value or mean of the quantity, 
under the probabilities (or fractions) given by p. 


Polynomial evaluation. Suppose the n-vector c represents the coefficients of 
a polynomial p of degree n — 1 or less: 


pa) = c1 + c2 +-+ + CnaT? + enet, 


Let t be a number, and let z = (1,t,t?,...,t”71) be the n-vector of powers 
of t. Then c?z = p(t), the value of the polynomial p at the point t. So the 
inner product of a polynomial coefficient vector and vector of powers of a 
number evaluates the polynomial at the number. 
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e Discounted total. Let c be an n-vector representing a cash flow, with c; the 
cash received (when c; > 0) in period i. Let d be the n-vector defined as 


d= (1,1/1 +r), ...,1/(1 tr), 
where r > 0 is an interest rate. Then 
d c= c1 +e2/(1+r) +- + cn/(1 +r)" 


is the discounted total of the cash flow, i.e., its net present value (NPV), with 
interest rate r. 


e Portfolio value. Suppose s is an n-vector representing the holdings in shares of 
a portfolio of n different assets, with negative values meaning short positions. 
If p is an n-vector giving the prices of the assets, then p’s is the total (or 
net) value of the portfolio. 


e Portfolio return. Suppose r is the vector of (fractional) returns of n assets 
over some time period, i.e., the asset relative price changes 


final initial 
r; = Pi Pi i=1 n 
i= pinitial > 8 one's 
where pi™itial and pfral are the (positive) prices of asset i at the beginning and 


end of the investment period. If h is an n-vector giving our portfolio, with hi 
denoting the dollar value of asset i held, then the inner product rTh is the 
total return of the portfolio, in dollars, over the period. If w represents the 
fractional (dollar) holdings of our portfolio, then r7w gives the total return 
of the portfolio. For example, if r7w = 0.09, then our portfolio return is 9%. 
If we had invested $10000 initially, we would have earned $900. 


e Document sentiment analysis. Suppose the n-vector x represents the his- 
togram of word occurrences in a document, from a dictionary of n words. 
Each word in the dictionary is assigned to one of three sentiment categories: 
Positive, Negative, and Neutral. The list of positive words might include 
‘nice’ and ‘superb’; the list of negative words might include ‘bad’ and ‘ter- 
rible’. Neutral words are those that are neither positive nor negative. We 
encode the word categories as an n-vector w, with w; = 1 if word 7 is posi- 
tive, with w; = —1 if word 7 is negative, and w; = 0 if word t is neutral. The 
number wz gives a (crude) measure of the sentiment of the document. 


Complexity of vector computations 


Computer representation of numbers and vectors. Real numbers are stored in 
computers using floating point format, which represents a real number using a 
block of 64 bits (Os and 1s), or 8 bytes (groups of 8 bits). Each of the 264 possible 
sequences of bits corresponds to a specific real number. The floating point numbers 
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span a very wide range of values, and are very closely spaced, so numbers that 
arise in applications can be approximated as floating point numbers to an accuracy 
of around 10 digits, which is good enough for almost all practical applications. 
Integers are stored in a more compact format, and are represented exactly. 

Vectors are stored as arrays of floating point numbers (or integers, when the 
entries are all integers). Storing an n-vector requires 8n bytes to store. Current 
memory and storage devices, with capacities measured in many gigabytes (10° 
bytes), can easily store vectors with dimensions in the millions or billions. Sparse 
vectors are stored in a more efficient way that keeps track of indices and values of 
the nonzero entries. 


Floating point operations. When computers carry out addition, subtraction, 
multiplication, division, or other arithmetic operations on numbers represented 
in floating point format, the result is rounded to the nearest floating point number. 
These operations are called floating point operations. The very small error in the 
computed result is called (floating point) round-off error. In most applications, 
these very small errors have no practical effect. Floating point round-off errors, 
and methods to mitigate their effect, are studied in a field called numerical analy- 
sis. In this book we will not consider floating point round-off error, but you should 
be aware that it exists. For example, when a computer evaluates the left-hand and 
right-hand sides of a mathematical identity, you should not be surprised if the two 
numbers are not equal. They should, however, be very close. 


Flop counts and complexity. So far we have seen only a few vector operations, like 
scalar multiplication, vector addition, and the inner product. How quickly these 
operations can be carried out by a computer depends very much on the computer 
hardware and software, and the size of the vector. 

A very rough estimate of the time required to carry out some computation, such 
as an inner product, can be found by counting the total number of floating point 
operations, or FLOPs. This term is in such common use that the acronym is now 
written in lower case letters, as flops, and the speed with which a computer can 
carry out flops is expressed in Gflop/s (gigaflops per second, i.e., billions of flops 
per second). Typical current values are in the range of 1-10 Gflop/s, but this can 
vary by several orders of magnitude. The actual time it takes a computer to carry 
out some computation depends on many other factors beyond the total number of 
flops required, so time estimates based on counting flops are very crude, and are 
not meant to be more accurate than a factor of ten or so. For this reason, gross 
approximations (such as ignoring a factor of 2) can be used when counting the flops 
required in a computation. 

The complexity of an operation is the number of flops required to carry it out, as 
a function of the size or sizes of the input to the operation. Usually the complexity 
is highly simplified, dropping terms that are small or negligible (compared to other 
terms) when the sizes of the inputs are large. In theoretical computer science, the 
term ‘complexity’ is used in a different way, to mean the number of flops of the best 
method to carry out the computation, i.e., the one that requires the fewest flops. 
In this book, we use the term complexity to mean the number of flops required by 
a specific method. 
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Complexity of vector operations. Scalar-vector multiplication ax, where x is an 
n-vector, requires n multiplications, i.e., ax; for i = 1,...,n. Vector addition x+y 
of two n-vectors takes n additions, i.e., x; + yi for i = 1,...,n. Computing the 
inner product aly = 21Y1 +-::+2nYn of two n-vectors takes 2n — 1 flops, n scalar 
multiplications and n—1 scalar additions. So scalar multiplication, vector addition, 
and the inner product of n-vectors require n, n, and 2n — 1 flops, respectively. 
We only need an estimate, so we simplify the last to 2n flops, and say that the 
complexity of scalar multiplication, vector addition, and the inner product of n- 
vectors is n, n, and 2n flops, respectively. We can guess that a 1 Gflop/s computer 
can compute the inner product of two vectors of size one million in around one 
thousandth of a second, but we should not be surprised if the actual time differs 
by a factor of 10 from this value. 

The order of the computation is obtained by ignoring any constant that multi- 
plies a power of the dimension. So we say that the three vector operations scalar 
multiplication, vector addition, and inner product have order n. Ignoring the fac- 
tor of 2 dropped in the actual complexity of the inner product is reasonable, since 
we do not expect flop counts to predict the running time with an accuracy better 
than a factor of 2. The order is useful in understanding how the time to execute 
the computation will scale when the size of the operands changes. An order n 
computation should take around 10 times longer to carry out its computation on 
an input that is 10 times bigger. 


Complexity of sparse vector operations. If x is sparse, then computing ax re- 
quires nnz(z) flops. If x and y are sparse, computing x + y requires no more than 
min{nnz(x), nnz(y)} flops (since no arithmetic operations are required to compute 
(a + y); when either x; or y; is zero). If the sparsity patterns of x and y do not 
overlap (intersect), then zero flops are needed to compute x+y. The inner product 
calculation is similar: computing 27 y requires no more than 2 min{nnz(z), nnz(y) } 
flops. When the sparsity patterns of x and y do not overlap, computing x7 y re- 
quires zero flops, since 27 y = 0 in this case. 
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1.1 Vector equations. Determine whether each of the equations below is true, false, or contains 
bad notation (and therefore does not make sense). 


1 
(a) | 2 | = (1,2,1). 


(c) (1, (2, 1)) = ((1, 2), 1). 


1.2 Vector notation. Which of the following expressions uses correct notation? When the 
expression does make sense, give its length. In the following, a and b are 10-vectors, and 
c is a 20-vector. 


(a) a +b- c3:12. 
(b) (a,b, c3:13). 

(c) 2a +c. 

(d) (a,1) + (c1,b). 
e) ((a,b),a). 

f) [ab | +4c. 


(g | |+ 


1.3 Overloading. Which of the following expressions uses correct notation? If the notation is 
correct, is it also unambiguous? Assume that a is a 10-vector and b is a 20-vector. 


(a) b= (0,a). 
(b) a = (0,b). 
(c) b= (0,a,0). 
(d) a=0 =b. 


1.4 Periodic energy usage. The 168-vector w gives the hourly electricity consumption of 
a manufacturing plant, starting on Sunday midnight to 1AM, over one week, in MWh 
(megawatt-hours). The consumption pattern is the same each day, i.e., it is 24-periodic, 
which means that wi+24 = we for t = 1,...,144. Let d be the 24-vector that gives the 
energy consumption over one day, starting at midnight. 


(a) Use vector notation to express w in terms of d. 


(b) Use vector notation to express d in terms of w. 


1.5 Interpreting sparsity. Suppose the n-vector x is sparse, i.e., has only a few nonzero entries. 
Give a short sentence or two explaining what this means in each of the following contexts. 


(a) x represents the daily cash flow of some business over n days. 


(b) x represents the annual dollar value purchases by a customer of n products or ser- 
vices. 


c) x represents a portfolio, say, the dollar value holdings of n stocks. 
d) x represents a bill of materials for a project, i.e., the amounts of n materials needed. 


e) x represents a monochrome image, i.e., the brightness values of n pixels. 


f) x is the daily rainfall in a location over one year. 
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1.6 


1.7 


1.8 


1.9 


1.10 


1.11 


1.12 


Vector of differences. Suppose x is an n-vector. The associated vector of differences is the 
(n — 1)-vector d given by d = (x2 — £1, £3 — %2,...,%n — Zn—1). Express d in terms of x 
using vector operations (e.g., slicing notation, sum, difference, linear combinations, inner 
product). The difference vector has a simple interpretation when x represents a time 
series. For example, if x gives the daily value of some quantity, d gives the day-to-day 
changes in the quantity. 


Transforming between two encodings for Boolean vectors. A Boolean n-vector is one 
for which all entries are either 0 or 1. Such vectors are used to encode whether each 
of n conditions holds, with a; = 1 meaning that condition 7 holds. Another common 
encoding of the same information uses the two values —1 and +1 for the entries. For 
example the Boolean vector (0,1,1,0) would be written using this alternative encoding 
as (—1, +1, +1,—1). Suppose that x is a Boolean vector with entries that are 0 or 1, and 
y is a vector encoding the same information using the values —1 and +1. Express y in 
terms of x using vector notation. Also, express x in terms of y using vector notation. 


Profit and sales vectors. A company sells n different products or items. The n-vector p 
gives the profit, in dollars per unit, for each of the n items. (The entries of p are typically 
positive, but a few items might have negative entries. These items are called loss leaders, 
and are used to increase customer engagement in the hope that the customer will make 
other, profitable purchases.) The n-vector s gives the total sales of each of the items, over 
some period (such as a month), i.e., s; is the total number of units of item i sold. (These 
are also typically nonnegative, but negative entries can be used to reflect items that were 
purchased in a previous time period and returned in this one.) Express the total profit in 
terms of p and s using vector notation. 


Symptoms vector. A 20-vector s records whether each of 20 different symptoms is present 
in a medical patient, with s; = 1 meaning the patient has the symptom and s; = 0 
meaning she does not. Express the following using vector notation. 


(a) The total number of symptoms the patient has. 
(b) The patient exhibits five out of the first ten symptoms. 


Total score from course record. The record for each student in a class is given as a 10- 
vector r, where r1,...,78 are the grades for the 8 homework assignments, each on a 0-10 
scale, rg is the midterm exam grade on a 0-120 scale, and rio is final exam score on a 
0-160 scale. The student’s total course score s, on a 0-100 scale, is based 25% on the 
homework, 35% on the midterm exam, and 40% on the final exam. Express s in the form 
s = wr. (That is, determine the 10-vector w.) You can give the coefficients of w to 4 
digits after the decimal point. 


Word count and word count histogram vectors. Suppose the n-vector w is the word count 
vector associated with a document and a dictionary of n words. For simplicity we will 
assume that all words in the document appear in the dictionary. 


(a) What is 17 w? 
(b) What does w2s2 = 0 mean? 


(c) Let h be the n-vector that gives the histogram of the word counts, i.e., hi is the 
fraction of the words in the document that are word 7. Use vector notation to express 
h in terms of w. (You can assume that the document contains at least one word.) 


Total cash value. An international company holds cash in five currencies: USD (US 
dollar), RMB (Chinese yuan), EUR (euro), GBP (British pound), and JPY (Japanese 
yen), in amounts given by the 5-vector c. For example, c2 gives the number of RMB held. 
Negative entries in c represent liabilities or amounts owed. Express the total (net) value 
of the cash in USD, using vector notation. Be sure to give the size and define the entries 
of any vectors that you introduce in your solution. Your solution can refer to currency 
exchange rates. 


1.13 


1.14 


1.15 


1.16 


1.17 


1.18 
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Average age in a population. Suppose the 100-vector x represents the distribution of ages 
in some population of people, with x; being the number of i—1 year olds, for i = 1,..., 100. 
(You can assume that x 4 0, and that there is no one in the population over age 99.) 
Find expressions, using vector notation, for the following quantities. 


(a) The total number of people in the population. 
(b) The total number of people in the population age 65 and over. 


(c) The average age of the population. (You can use ordinary division of numbers in 
your expression.) 


Industry or sector exposure. Consider a set of n assets or stocks that we invest in. Let 
f be an n-vector that encodes whether each asset is in some specific industry or sector, 
e.g., pharmaceuticals or consumer electronics. Specifically, we take fi = 1 if asset į is in 
the sector, and f; = 0 if it is not. Let the n-vector h denote a portfolio, with h; the dollar 
value held in asset i (with negative meaning a short position). The inner product f7h 
is called the (dollar value) exposure of our portfolio to the sector. It gives the net dollar 
value of the portfolio that is invested in assets from the sector. A portfolio h is called 
neutral (to a sector or industry) if f"h = 0. 

A portfolio h is called long only if each entry is nonnegative, i.e., hi > 0 for each i. This 
means the portfolio does not include any short positions. 

What does it mean if a long-only portfolio is neutral to a sector, say, pharmaceuticals? 
Your answer should be in simple English, but you should back up your conclusion with 
an argument. 


Cheapest supplier. You must buy n raw materials in quantities given by the n-vector q, 
where q; is the amount of raw material i that you must buy. A set of K potential suppliers 
offer the raw materials at prices given by the n-vectors pi,...,pK. (Note that pp is an 
n-vector; (pk): is the price that supplier k charges per unit of raw material i.) We will 
assume that all quantities and prices are positive. 

If you must choose just one supplier, how would you do it? Your answer should use vector 
notation. 

A (highly paid) consultant tells you that you might do better (i.e., get a better total cost) 
by splitting your order into two, by choosing two suppliers and ordering (1/2)q (i.e., half 
the quantities) from each of the two. He argues that having a diversity of suppliers is 
better. Is he right? If so, explain how to find the two suppliers you would use to fill half 
the order. 


Inner product of nonnegative vectors. A vector is called nonnegative if all its entries are 
nonnegative. 


(a) Explain why the inner product of two nonnegative vectors is nonnegative. 


(b) Suppose the inner product of two nonnegative vectors is zero. What can you say 
about them? Your answer should be in terms of their respective sparsity patterns, 
i.e., which entries are zero and nonzero. 


Linear combinations of cash flows. We consider cash flow vectors over T time periods, 
with a positive entry meaning a payment received, and negative meaning a payment 
made. A (unit) single period loan, at time period t, is the T-vector I; that corresponds 
to a payment received of $1 in period t and a payment made of $(1 + r) in period t +1, 
with all other payments zero. Here r > 0 is the interest rate (over one period). 

Let c be a $1 T — 1 period loan, starting at period 1. This means that $1 is received in 
period 1, $(1+ r)7~? is paid in period T, and all other payments (i.e., c2,...,er—1) are 
zero. Express c as a linear combination of single period loans. 


Linear combinations of linear combinations. Suppose that each of the vectors b1,..., bp is 
a linear combination of the vectors ai1,...,@m, and c is a linear combination of bi,..., br. 
Then c is a linear combination of ai1,...,@m. Show this for the case with m = k = 2. 


(Showing it in general is not much more difficult, but the notation gets more complicated.) 
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Auto-regressive model. Suppose that 21, z2,... is a time series, with the number z+ giving 
the value in period or time t. For example z; could be the gross sales at a particular store 
on day t. An auto-regressive (AR) model is used to predict 2:41 from the previous M 
values, Zt, Zt-1,---, 2t-M41! 


Ze41 = (Zt, 2t-1,---, Zt mii) B, t=M,M +1,.... 


Here 2:41 denotes the AR model’s prediction of z:41, M is the memory length of the 
AR model, and the M-vector 8 is the AR model coefficient vector. For this problem we 
will assume that the time period is daily, and M = 10. Thus, the AR model predicts 
tomorrow’s value, given the values over the last 10 days. 


For each of the following cases, give a short interpretation or description of the AR model 
in English, without referring to mathematical concepts like vectors, inner product, and 
so on. You can use words like ‘yesterday’ or ‘today’. 


(a) Beer. 

(b) 8 x 2e: — e2. 

(c) Bee. 

(d) 8 ~ 0.5e1 + 0.5e2. 


How many bytes does it take to store 100 vectors of length 10°? How many flops does 
it take to form a linear combination of them (with 100 nonzero coefficients)? About how 
long would this take on a computer capable of carrying out 1 Gflop/s? 
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Chapter 2 


Linear functions 


In this chapter we introduce linear and affine functions, and describe some common 
settings where they arise, including regression models. 


Linear functions 


Function notation. The notation f : R” —> R means that f is a function that 
maps real n-vectors to real numbers, i.e., it is a scalar-valued function of n-vectors. 
If x is an n-vector, then f(a), which is a scalar, denotes the value of the function f 
at x. (In the notation f(x), x is referred to as the argument of the function.) We 
can also interpret f as a function of n scalar arguments, the entries of the vector 
argument, in which case we write f(a) as 


f(x) T f(£1, £2,... sin): 


Here we refer to £1,..., £n as the arguments of f. We sometimes say that f is 

real-valued, or scalar-valued, to emphasize that f(x) is a real number or scalar. 
To describe a function f : R” —> R, we have to specify what its value is for any 

possible argument « € R”. For example, we can define a function f : Rf + R by 


f(t) = t1 +22- z? 


for any 4-vector x. In words, we might describe f as the sum of the first two 
elements of its argument, minus the square of the last entry of the argument. 
(This particular function does not depend on the third element of its argument.) 

Sometimes we introduce a function without formally assigning a symbol for it, 
by directly giving a formula for its value in terms of its arguments, or describing 
how to find its value from its arguments. An example is the sum function, whose 
value is 71 +- + £n. We can give a name to the value of the function, as in 
yY = zı +: + £n, and say that y is a function of x, in this case, the sum of its 
entries. 

Many functions are not given by formulas or equations. As an example, suppose 
f : RÈ > R is the function that gives the lift (vertical upward force) on a particular 
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airplane, as a function of the 3-vector x, where xı is the angle of attack of the 
airplane (i.e., the angle between the airplane body and its direction of motion), x2 
is its air speed, and z3 is the air density. 


The inner product function. Suppose a is an n-vector. We can define a scalar- 
valued function f of n-vectors, given by 


f(x) = alg = azı + azz +--+ anin (2.1) 


for any n-vector x. This function gives the inner product of its n-vector argument x 
with some (fixed) n-vector a. We can also think of f as forming a weighted sum of 
the elements of x; the elements of a give the weights used in forming the weighted 
sum. 


Superposition and linearity. The inner product function f defined in (2.1) satisfies 
the property 


f(ax+ By) = a™(ax+ By) 
= a’ (az) +a" (By) 
= alax) + play) 
= af(x)+Sfly) 


for all n-vectors x, y, and all scalars a, 8. This property is called superposition. 
A function that satisfies the superposition property is called linear. We have just 
shown that the inner product with a fixed vector is a linear function. 

The superposition equality 


flax + By) =af(x) + fly) (2.2) 


looks deceptively simple; it is easy to read it as just a re-arrangement of the paren- 
theses and the order of a few terms. But in fact it says a lot. On the left-hand 
side, the term ax + $y involves scalar-vector multiplication and vector addition. 
On the right-hand side, af(x) + f(y) involves ordinary scalar multiplication and 
scalar addition. 

If a function f is linear, superposition extends to linear combinations of any 
number of vectors, and not just linear combinations of two vectors: We have 


flair +: + akk) = arf (v1) +--+: +aKRf (re), 


for any n vectors #1,...,@,, and any scalars a,,...,a,. (This more general k-term 
form of superposition reduces to the two-term form given above when k = 2.) To 
see this, we note that 


flarti +: +aktk) = f(z) + flagre +--+ + akrk) 
a1 f (21) + af (£2) + f(a3z3 +--+ + akzk) 


II 


= anf (x1) fees Hf arf (xz). 
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In the first line here, we apply (two-term) superposition to the argument 
ayn, + (1)(agrg +++: + akTk), 


and in the other lines we apply this recursively. 

The superposition equality (2.2) is sometimes broken down into two proper- 
ties, one involving the scalar-vector product and one involving vector addition in 
the argument. A function f : R” — R is linear if it satisfies the following two 
properties. 


e Homogeneity. For any n-vector x and any scalar a, f(ax) = af (a). 
e Additivity. For any n-vectors x and y, f(x +y) = f(x) + f(y). 


Homogeneity states that scaling the (vector) argument is the same as scaling the 
function value; additivity says that adding (vector) arguments is the same as adding 
the function values. 


Inner product representation of a linear function. We saw above that a function 
defined as the inner product of its argument with some fixed vector is linear. The 
converse is also true: If a function is linear, then it can be expressed as the inner 
product of its argument with some fixed vector. 

Suppose f is a scalar-valued function of n-vectors, and is linear, i.e., (2.2) holds 
for all n-vectors x, y, and all scalars a, 3. Then there is an n-vector a such that 
f(a) =a" x for all x. We call a? x the inner product representation of f. 

To see this, we use the identity (1.1) to express an arbitrary n-vector x as 
£ = £161 +`- + Znen. If f is linear, then by multi-term superposition we have 


fia) = flevert--*+ Enen) 
= ta Fei) + + Enf en) 
T 
= az, 


with a = (f (e1), f(e2),.--; f(en)). The formula just derived, 


f(@) = z1 f (e1) + z2 f (e2) +: + anf (en) (2.3) 


which holds for any linear scalar-valued function f, has several interesting impli- 
cations. Suppose, for example, that the linear function f is given as a subroutine 
(or a physical system) that computes (or results in the output) f(x) when we give 
the argument (or input) x. Once we have found f(e1), ..., flen), by n calls to the 
subroutine (or n experiments), we can predict (or simulate) what f(x) will be, for 
any vector x, using the formula (2.3). 

The representation of a linear function f as f(x) =a’ x is unique, which means 
that there is only one vector a for which f(x) = a7 z holds for all x. To see this, 
suppose that we have f(x) = a’ x for all x, and also f(x) = bfz for all x. Taking 
xz = e;, we have f(e;) = aTe; = a;, using the formula f(x) = a7 z. Using the 
formula f(x) = bTx, we have f(e;) = be; = bi. These two numbers must be the 
same, so we have a; = b;. Repeating this argument for i = 1,...,n, we conclude 
that the corresponding elements in a and b are the same, so a = b. 
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Examples. 


e Average. The mean or average value of an n-vector is defined as 
f(x) = (@1 + z2 +: + 2n)/n, 


and is denoted avg() (and sometimes %). The average of a vector is a linear 
function. It can be expressed as avg(x) = a! x with 


a=(1/n,...,1/n) =1/n. 


e Maximum. The maximum element of an n-vector x, f(x) = max{x1,..., En}, 
is not a linear function (except when n = 1). We can show this by a coun- 
terexample for n = 2. Take x = (1,—1), y = (-1,1), a = 1/2, 8 = 1/2. 
Then 

flax + By) =0 A af(ax) + Bf(y) =1. 


Affine functions. A linear function plus a constant is called an affine function. A 
function f : R” > R is affine if and only if it can be expressed as f(x) = a7 x +b 
for some n-vector a and scalar b, which is sometimes called the offset. For example, 
the function on 3-vectors defined by 


f(x) = 2.3 — 2x1 + 1.322 — 23, 


is affine, with b = 2.3, a = (—2,1.3,—1). 
Any affine scalar-valued function satisfies the following variation on the super- 
position property: 


flax + By) = af (x) + Bfly), 


for all n-vectors x, y, and all scalars a, 8 that satisfy a+ (6 = 1. For linear functions, 
superposition holds for any coefficients a and 8; for affine functions, it holds when 
the coefficients sum to one (t.e., when the argument is an affine combination). 

To see that the restricted superposition property holds for an affine function 
f(x) =a’ a+b, we note that, for any vectors x, y and scalars a and 8 that satisfy 
a+6=1, 


flax + By) = aT(oz+ py) +b 
= aar + pay + (a+ fyb 
= a(aTx+b)+8laTy +b) 
= af() + Aft). 


(In the second line we use a + 8 = 1.) 

This restricted superposition property for affine functions is useful in showing 
that a function f is not affine: We find vectors x, y, and numbers a and 8 with 
a+ 6 = 1, and verify that f(ax+ By) 4 af(x)+ Bf(y). This shows that f cannot 
be affine. As an example, we verified above that superposition does not hold for 
the maximum function (with n > 1); the coefficients in our counterexample are 
a = 8 = 1/2, which sum to one, which allows us to conclude that the maximum 
function is not affine. 
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f(x) g(x) 


Figure 2.1 Left. The function f is linear. Right. The function g is affine, 
but not linear. 


The converse is also true: Any scalar-valued function that satisfies the restricted 
superposition property is affine. An analog of the formula (2.3) is 


f(x) = f(0) + 1 (Fler) — F(0)) +: -+ + an (Fen) — FO); (2.4) 


which holds when f is affine, and x is any n-vector. (See exercise 2.7.) This formula 
shows that for an affine function, once we know the n+ 1 numbers f(0), f(e1), ..., 
f(en), we can predict (or reconstruct or evaluate) f(x) for any n-vector x. It also 
shows how the vector a and constant b in the representation f(x) = ax +b can 
be found from the function f: a; = f(e:) — f(0), and b = f(0). 

In some contexts affine functions are called linear. For example, when z is a 
scalar, the function f defined as f(x) = ax + 8 is sometimes referred to as a linear 
function of x, perhaps because its graph is a line. But when 8 Æ 0, f is not a linear 
function of x, in the standard mathematical sense; it is an affine function of zx. 
In this book we will distinguish between linear and affine functions. Two simple 
examples are shown in figure 2.1. 


A civil engineering example. Many scalar-valued functions that arise in science 
and engineering are well approximated by linear or affine functions. As a typical 
example, consider a steel structure like a bridge, and let w be an n-vector that 
gives the weight of the load on the bridge in n specific locations, in metric tons. 
These loads will cause the bridge to deform (move and change shape) slightly. 
Let s denote the distance that a specific point on the bridge sags, in millimeters, 
due to the load w. This is shown in figure 2.2. For weights the bridge is designed 
to handle, the sag is very well approximated as a linear function s = f(x). This 
function can be expressed as an inner product, s = c!'w, for some n-vector c. From 
the equation s = c1w1 +--++¢nWpr, we see that ciw is the amount of the sag that 
is due to the weight w1, and similarly for the other weights. The coefficients c;, 
which have units of mm/ton, are called compliances, and give the sensitivity of the 
sag with respect to loads applied at the n locations. 

The vector c can be computed by (numerically) solving a partial differential 
equation, given the detailed design of the bridge and the mechanical properties of 
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Figure 2.2 A bridge with weights w1, w2, w3 applied in 3 locations. These 
weights cause the bridge to sag in the middle, by an amount s. (The sag is 
exaggerated in this diagram.) 


2.2 


2.2 Taylor approximation 
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Ww, Wa ‘wg Measured sag Predicted sag 


1 0 0 0.12 — 
0 1 0 0.31 — 
0 0 1 0.26 — 
0.5 11 0.3 0.481 0.479 
1.5 0.8 1.2 0.736 0.740 


Table 2.1 Loadings on a bridge (first three columns), the associated mea- 
sured sag at a certain point (fourth column), and the predicted sag using 
the linear model constructed from the first three experiments (fifth column). 


the steel used to construct it. This is always done during the design of a bridge. 
The vector c can also be measured once the bridge is built, using the formula (2.3). 
We apply the load w = e1, which means that we place a one ton load at the first 
load position on the bridge, with no load at the other positions. We can then 
measure the sag, which is cı. We repeat this experiment, moving the one ton load 
to positions 2,3,...,n, which gives us the coefficients c2,...,¢n. At this point 
we have the vector c, so we can now predict what the sag will be with any other 
loading. To check our measurements (and linearity of the sag function) we might 
measure the sag under other more complicated loadings, and in each case compare 
our prediction (i.e., cw) with the actual measured sag. 

Table 2.1 shows what the results of these experiments might look like, with each 
row representing an experiment (i.e., placing the loads and measuring the sag). In 
the last two rows we compare the measured sag and the predicted sag, using the 
linear function with coefficients found in the first three experiments. 


Taylor approximation 


In many applications, scalar-valued functions of n variables, or relations between 
n variables and a scalar one, can be approximated as linear or affine functions. In 
these cases we sometimes refer to the linear or affine function relating the vari- 
ables and the scalar variable as a model, to remind us that the relation is only an 
approximation, and not exact. 

Differential calculus gives us an organized way to find an approximate affine 
model. Suppose that f : R” — R is differentiable, which means that its par- 
tial derivatives exist (see §C.1). Let z be an n-vector. The (first-order) Taylor 
approximation of f near (or at) the point z is the function f(a) of x defined as 


of 


aea a a oa 


f(x) = f(z) + 


where SL (z) denotes the partial derivative of f with respect to its ith argument, 
evaluated at the n-vector z. The hat appearing over f on the left-hand side is 
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a common notational hint that it is an approximation of the function f. (The 
approximation is named after the mathematician Brook Taylor.) 

The first-order Taylor approximation f(x) is a very good approximation of f(z) 
when all x; are near the associated z;. Sometimes f is written with a second vector 
argument, as f (x; z), to show the point z at which the approximation is developed. 
The first term in the Taylor approximation is a constant; the other terms can be 
interpreted as the contributions to the (approximate) change in the function value 
(from f(z)) due to the changes in the components of x (from z). 

Evidently f is an affine function of x. (It is sometimes called the linear approx- 
imation of f near z, even though it is in general affine, and not linear.) It can be 
written compactly using inner product notation as 


f(z) = f(z) + VIe) (@ - 2), (2.5) 
where V f(z) is an n-vector, the gradient of f (at the point z), 
oe, (2) 
Vi(z) = : (2.6) 
AO) 


The first term in the Taylor approximation (2.5) is the constant f(z), the value of 
the function when x = z. The second term is the inner product of the gradient of 
f at z and the deviation or perturbation of x from z, i.e., £ — Z. 

We can express the first-order Taylor approximation as a linear function plus a 
constant, . 

f(x) = VEE + (F(z) -= VEe)" 2), 
but the form (2.5) is perhaps easier to interpret. 

The first-order Taylor approximation gives us an organized way to construct 
an affine approximation of a function f : R” —> R, near a given point z, when 
there is a formula or equation that describes f, and it is differentiable. A simple 
example, for n = 1, is shown in figure 2.3. Over the full x-axis scale shown, the 
Taylor approximation f does not give a good approximation of the function f. But 
for x near z, the Taylor approximation is very good. 


Example. Consider the function f : R? > R given by f(x) = zı + exp(x2 — z1), 
which is not linear or affine. To find the Taylor approximation f near the point 
z = (1,2), we take partial derivatives to obtain 
1 — exp(z2 — 21) 
V = 
f(z) exp(Z2 — 21) 
which evaluates to (—1.7183, 2.7183) at z = (1,2). The Taylor approximation at 
z = (1,2) is then 
f(z) = 3.7183 + (—1.7183, 2.7183)? (x — (1, 2)) 
3.7183 — 1.7183(a21 — 1) + 2.7183(x2 — 2). 
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Table 2.2 shows f(x) and f(x), and the approximation error | f(a) — f (x)|, for some 
values of x relatively near z. We can see that f is indeed a very good approximation 
of f, especially when z is near z. 
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Figure 2.3 A function f of one variable, and the first-order Taylor approxi- 
mation f(x) = f(z) + f’(z)(@— z) at z. 


a f(z) fe Ie- rel 
3.7183 3.7183 0.0000 


( ) 

( 98) 3.7332 3.7326 0.0005 
(1.10,2.11) 3.8456 3.8455 0.0001 
( ) 
( ) 


4.1701 4.1119 0.0582 
4.4399 4.4032 0.0367 


Table 2.2 Some values of x (first column), the function value f(x) (sec- 


ond column), the Taylor approximation f(x) (third column), and the error 
(fourth column). 
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2.3 


Regression model 


In this section we describe a very commonly used affine function, especially when 
the n-vector x represents a feature vector. The affine function of x given by 


g=2" B+», (2.7) 


where ( is an n-vector and v is a scalar, is called a regression model. In this context, 
the entries of x are called the regressors, and y is called the prediction, since the 
regression model is typically an approximation or prediction of some true value y, 
which is called the dependent variable, outcome, or label. 

The vector 8 is called the weight vector or coefficient vector, and the scalar 
v is called the offset or intercept in the regression model. Together, 8 and v are 
called the parameters in the regression model. (We will see in chapter 13 how the 
parameters in a regression model can be estimated or guessed, based on some past 
or known observations of the feature vector x and the associated outcome y.) The 
symbol ¥ is used in the regression model to emphasize that it is an estimate or 
prediction of some outcome y. 

The entries in the weight vector have a simple interpretation: (; is the amount 
by which g increases (if 3; > 0) when feature i increases by one (with all other 
features the same). If 8; is small, the prediction § doesn’t depend too strongly on 
feature i. The offset v is the value of 7 when all features have the value 0. 

The regression model is very interpretable when all of the features are Boolean, 
i.e., have values that are either 0 or 1, which occurs when the features represent 
which of two outcomes holds. As a simple example consider a regression model 
for the lifespan of a person in some group, with x; = 0 if the person is female 
(xı = 1 if male), x2 = 1 if the person has type II diabetes, and x3 = 1 if the person 
smokes cigarettes. In this case, v is the regression model estimate for the lifespan 
of a female nondiabetic nonsmoker; (1 is the increase in estimated lifespan if the 
person is male, (2 is the increase in estimated lifespan if the person is diabetic, 
and (33 is the increase in estimated lifespan if the person smokes cigarettes. (In a 
model that fits real data, all three of these coefficients would be negative, meaning 
that they decrease the regression model estimate of lifespan.) 


Simplified regression model notation. Vector stacking can be used to lump the 
weights and offset in the regression model (2.7) into a single parameter vector, 
which simplifies the regression model notation a bit. We create a new regressor 
vector Z, with n+ 1 entries, as ¢ = (1,2). We can think of % as a new feature 
vector, consisting of all n original features, and one new feature added (#1) at 
the beginning, which always has the value one. We define the parameter vector 
B= (v, B), so the regression model (2.7) has the simple inner product form 


g=a"otu= [1] [8 ]=278 (2.8) 


Often we omit the tildes, and simply write this as 7 = xT 8, where we assume that 
the first feature in x is the constant 1. A feature that always has the value 1 is 
not particularly informative or interesting, but it does simplify the notation in a 
regression model. 
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House 2 (area) 22 (beds) y (price) % (prediction) 


1 0.846 1 115.00 161.37 
2 1.324 2 234.50 213.61 
3 1.150 3 198.00 168.88 
4 3.037 4 528.00 430.67 
5 3.984 5 572.50 552.66 


Table 2.3 Five houses with associated feature vectors shown in the second 
and third columns. The fourth and fifth column give the actual price, and 
the price predicted by the regression model. 


House price regression model. As a simple example of a regression model, sup- 
pose that y is the selling price of a house in some neighborhood, over some time 
period, and the 2-vector x contains attributes of the house: 


e xı is the house area (in 1000 square feet), 
èe x is the number of bedrooms. 


If y represents the selling price of the house, in thousands of dollars, the regression 
model 
Ñ = 17B +v = fixı + Bota +0 


predicts the price in terms of the attributes or features. This regression model is 
not meant to describe an exact relationship between the house attributes and its 
selling price; it is a model or approximation. Indeed, we would expect such a model 
to give, at best, only a crude approximation of selling price. 

As a specific numerical example, consider the regression model parameters 


B = (148.73, —18.85), v = 54.40. (2.9) 


These parameter values were found using the methods we will see in chapter 13, 
based on records of sales for 774 houses in the Sacramento area. Table 2.3 shows 
the feature vectors x for five houses that sold during the period, the actual sale 
price y, and the predicted price from the regression model above. Figure 2.4 
shows the predicted and actual sale prices for 774 houses, including the five houses 
in the table, on a scatter plot, with actual price on the horizontal axis and predicted 
price on the vertical axis. 

We can see that this particular regression model gives reasonable, but not very 
accurate, predictions of the actual sale price. (Regression models for house prices 
that are used in practice use many more than two regressors, and are much more 
accurate.) 

The model parameters in (2.9) are readily interpreted. The parameter 6; = 
148.73 is the amount the regression model price prediction increases (in thousands 
of dollars) when the house area increases by 1000 square feet (with the same number 
of bedrooms). The parameter 82 = —18.85 is the price prediction increase with 
the addition of one bedroom, with the total house area held constant, in units of 
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Figure 2.4 Scatter plot of actual and predicted sale prices for 774 houses 
sold in Sacramento during a five-day period. 
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thousands of dollars per bedroom. It might seem strange that (2 is negative, since 
one imagines that adding a bedroom to a house would increase its sale price, not 
decrease it. To understand why 62 might be negative, we note that it gives the 
change in predicted price when we add a bedroom, without adding any additional 
area to the house. If we remodel a house by adding a bedroom that also adds more 
than around 127 square feet to the house area, the regression model (2.9) does 
predict an increase in house sale price. The offset v = 54.40 is the predicted price 
for a house with no area and no bedrooms, which we might interpret as the model’s 
prediction of the value of the lot. But this regression model is crude enough that 
these interpretations are dubious. 
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2.1 


2.2 


2.3 


Exercises 


Linear or not? Determine whether each of the following scalar-valued functions of n- 
vectors is linear. If it is a linear function, give its inner product representation, i.e., an 
n-vector a for which f(x) = a7 for all x. If it is not linear, give specific x, y, a, and 8 
for which superposition fails, i.e., 


flax + By) # af(x) + BF(y). 


(a) The spread of values of the vector, defined as f(x) = maxp £k — ming Tk- 
(b) The difference of the last element and the first, f(x) = £n — 41. 


(c) The median of an n-vector, where we will assume n = 2k +1 is odd. The median of 
the vector x is defined as the (k + 1)st largest number among the entries of x. For 
example, the median of (—7.1, 3.2, —1.5) is —1.5. 


(d) The average of the entries with odd indices, minus the average of the entries with 
even indices. You can assume that n = 2k is even. 


(e) Vector extrapolation, defined as £n + (£n — £n-1), for n > 2. (This is a simple 
prediction of what x41 would be, based on a straight line drawn through £n and 
Ln—1-) 


Processor powers and temperature. The temperature T of an electronic device containing 
three processors is an affine function of the power dissipated by the three processors, 
P = (Pa, P2, P>). When all three processors are idling, we have P = (10,10,10), which 
results in a temperature T = 30. When the first processor operates at full power and 
the other two are idling, we have P = (100, 10,10), and the temperature rises to T = 60. 
When the second processor operates at full power and the other two are idling, we have 
P = (10, 100,10) and T = 70. When the third processor operates at full power and the 
other two are idling, we have P = (10, 10,100) and T = 65. Now suppose that all three 
processors are operated at the same power P**'"°. How large can P**™° be, if we require 
that T < 85? Hint. From the given data, find the 3-vector a and number b for which 
T=a'P+b. 


Motion of a mass in response to applied force. A unit mass moves on a straight line (in 
one dimension). The position of the mass at time t (in seconds) is denoted by s(t), and its 
derivatives (the velocity and acceleration) by s’(t) and s(t). The position as a function 
of time can be determined from Newton’s second law 


s” (t) = F(t), 


where F(t) is the force applied at time t, and the initial conditions s(0), s’(0). We assume 
F(t) is piecewise-constant, and is kept constant in intervals of one second. The sequence 
of forces F(t), for 0 < t < 10, can then be represented by a 10-vector f, with 


F(t)= fe, k-1<t<k. 


Derive expressions for the final velocity s’(10) and final position s(10). Show that s(10) 
and s’(10) are affine functions of x, and give 10-vectors a,c and constants b, d for which 


s'(10) =a" f +b, s(10) = Tf +d. 


This means that the mapping from the applied force sequence to the final position and 
velocity is affine. 


Hint. You can use 


You will find that the mass velocity s’(t) is piecewise-linear. 


2.4 


2.5 
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2.9 
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Linear function? The function ¢ : R? — R satisfies 
(1, 1,0) = —1, ¢(-1,1,1) = 1, g(1,-1,-1) = 1. 


Choose one of the following, and justify your choice: ¢ must be linear; ¢ could be linear; 
ġ cannot be linear. 


Affine function. Suppose 7) : R? > R is an affine function, with (1,0) = 1, (1, —2) = 2. 


(a) What can you say about 7(1,—1)? Either give the value of w(1,—1), or state that 
it cannot be determined. 


(b) What can you say about ~(2,—2)? Either give the value of (2, —2), or state that 
it cannot be determined. 


Justify your answers. 


Questionnaire scoring. A questionnaire in a magazine has 30 questions, broken into 
two sets of 15 questions. Someone taking the questionnaire answers each question with 
‘Rarely’, ‘Sometimes’, or ‘Often’. The answers are recorded as a 30-vector a, with a; = 
1, 2,3 if question 7 is answered Rarely, Sometimes, or Often, respectively. The total score 
on a completed questionnaire is found by adding up 1 point for every question answered 
Sometimes and 2 points for every question answered Often on questions 1-15, and by 
adding 2 points and 4 points for those responses on questions 16-30. (Nothing is added to 
the score for Rarely responses.) Express the total score s in the form of an affine function 
s = wa +v, where w is a 30-vector and v is a scalar (number). 


General formula for affine functions. Verify that formula (2.4) holds for any affine function 
f :R" >R. You can use the fact that f(x) = a’ x +b for some n-vector a and scalar b. 


Integral and derivative of polynomial. Suppose the n-vector c gives the coefficients of a 
polynomial p(x) = cı + cot +--+ + en 0"~ 


(a) Let a and 8 be numbers with a < 8. Find an n-vector a for which 


always holds. This means that the integral of a polynomial over an interval is a 
linear function of its coefficients. 


(b) Let a be a number. Find an n-vector b for which 
bc = p' (a). 


This means that the derivative of the polynomial at a given point is a linear function 
of its coefficients. 


Taylor approximation. Consider the function f : R? > R given by f(z1,£2) = £122. 


Find the Taylor approximation f at the point z = (1,1). Compare f(x) and f(x) for the 
following values of zx: 


x= (1,1), æ= (1.05,0.95), {x= (0.85,1.25), æ= (1,2). 


Make a brief comment about the accuracy of the Taylor approximation in each case. 
Regression model. Consider the regression model = «78 + v, where ĝ is the predicted 
response, x is an 8-vector of features, 8 is an 8-vector of coefficients, and v is the offset 
term. Determine whether each of the following statements is true or false. 

(a) If 83 > 0 and z3 > 0, then ĝ > 0. 

(b) If 82 = 0 then the prediction ĝ does not depend on the second feature x2. 


(c) If 6g = —0.8, then increasing xg (keeping all other x;s the same) will decrease ĝ. 
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2.11 


2.12 


Sparse regression weight vector. Suppose that x is an n-vector that gives n features for 
some object, and the scalar y is some outcome associated with the object. What does it 
mean if a regression model = x7 6+ v uses a sparse weight vector 6? Give your answer 
in English, referring to 9 as our prediction of the outcome. 


Price change to maximize profit. A business sells n products, and is considering changing 
the price of one of the products to increase its total profits. A business analyst develops a 
regression model that (reasonably accurately) predicts the total profit when the product 
prices are changed, given by p= 87 x + P, where the n-vector x denotes the fractional 


new 


change in the product prices, x; = (p7°” — p:)/pi. Here P is the profit with the current 
prices, P is the predicted profit with the changed prices, p; is the current (positive) price 


of product i, and p;°” is the new price of product i. 


(a) What does it mean if 83 < 0? (And yes, this can occur.) 


(b) Suppose that you are given permission to change the price of one product, by up 
to 1%, to increase total profit. Which product would you choose, and would you 
increase or decrease the price? By how much? 


(c) Repeat part (b) assuming you are allowed to change the price of two products, each 
by up to 1%. 
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Chapter 3 


Norm and distance 


In this chapter we focus on the norm of a vector, a measure of its magnitude, and 
on related concepts like distance, angle, standard deviation, and correlation. 


Norm 


The Euclidean norm of an n-vector x (named after the Greek mathematician Eu- 
clid), denoted ||2||, is the squareroot of the sum of the squares of its elements, 


loll = 28 +23 + + 23. 


The Euclidean norm can also be expressed as the squareroot of the inner product 
of the vector with itself, i.e., ||x|| = Va? x. 

The Euclidean norm is sometimes written with a subscript 2, as ||z||2. (The 
subscript 2 indicates that the entries of x are raised to the second power.) Other 
less widely used terms for the Euclidean norm of a vector are the magnitude, or 
length, of a vector. (The term length should be avoided, since it is also often used 
to refer to the dimension of the vector.) We use the same notation for the norms 
of vectors of different dimensions. 

As simple examples, we have 


a 0 
ilk Ile 
9 —1 


When z is a scalar, i.e., a l-vector, the Euclidean norm is the same as the 
absolute value of x. Indeed, the Euclidean norm can be considered a generalization 
or extension of the absolute value or magnitude, that applies to vectors. The double 
bar notation is meant to suggest this. Like the absolute value of a number, the 
norm of a vector is a (numerical) measure of its magnitude. We say a vector is 
small if its norm is a small number, and we say it is large if its norm is a large 
number. (The numerical values of the norm that qualify for small or large depend 
on the particular application and context.) 


3 Norm and distance 


Properties of norm. Some important properties of the Euclidean norm are given 
below. Here x and y are vectors of the same size, and ( is a scalar. 


e Nonnegative homogeneity. ||8x|| = |G]|||x||. Multiplying a vector by a scalar 
multiplies the norm by the absolute value of the scalar. 


e Triangle inequality. ||x+yl| < ||z||+|ly||. The Euclidean norm of a sum of two 
vectors is no more than the sum of their norms. (The name of this property 
will be explained later.) Another name for this inequality is subadditivity. 


e Nonnegativity. ||x|| > 0. 
e Definiteness. ||x|| = 0 only if z = 0. 


The last two properties together, which state that the norm is always nonnegative, 
and zero only when the vector is zero, are called positive definiteness. The first, 
third, and fourth properties are easy to show directly from the definition of the 
norm. As an example, let’s verify the definiteness property. If ||a|| = 0, then 
we also have |||? = 0, which means that 7? +---+ 22 =0. This is a sum of n 
nonnegative numbers, which is zero. We can conclude that each of the n numbers is 
zero, since if any of them were nonzero the sum would be positive. So we conclude 
that x? = 0 for i=1,...,n, and therefore x; = 0 for i=1,...,n; and thus, x = 0. 
Establishing the second property, the triangle inequality, is not as easy; we will 
give a derivation on page 57. 


General norms. Any real-valued function of an n-vector that satisfies the four 
properties listed above is called a (general) norm. But in this book we will only 
use the Euclidean norm, so from now on, we refer to the Euclidean norm as the 
norm. (See exercise 3.5, which describes some other useful norms.) 


Root-mean-square value. The norm is related to the root-mean-square (RMS) 
value of an n-vector x, defined as 


The argument of the squareroot in the middle expression is called the mean square 
value of x, denoted ms(x), and the RMS value is the squareroot of the mean square 
value. The RMS value of a vector x is useful when comparing norms of vectors 
with different dimensions; the RMS value tells us what a ‘typical’ value of |x;| is. 
For example, the norm of 1, the n-vector of all ones, is y/n, but its RMS value is 1, 
independent of n. More generally, if all the entries of a vector are the same, say, 
a, then the RMS value of the vector is |a]. 


Norm of asum. A useful formula for the norm of the sum of two vectors x and 
y is 


lz + yll = lel? + 2a y + |lyll?. (3.1) 
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To derive this formula, we start with the square of the norm of x+y and use various 
properties of the inner product: 


c+ yl? = (+y) (z+y) 
= aaetatyty e+ yTy 
= jæ? +22"y + llyll?. 


Taking the squareroot of both sides yields the formula (3.1) above. In the first 
line, we use the definition of the norm. In the second line, we expand the inner 
product. In the fourth line we use the definition of the norm, and the fact that 
xTy = y’x. Some other identities relating norms, sums, and inner products of 
vectors are explored in exercise 3.4. 


Norm of block vectors. The norm-squared of a stacked vector is the sum of the 
norm-squared values of its subvectors. For example, with d = (a,b,c) (where a, b, 
and c are vectors), we have 


lal? = 47d = aa + bb + ãc = |all? + oll? + Iel’. 


This idea is often used in reverse, to express the sum of the norm-squared values 
of some vectors as the norm-square value of a block vector formed from them. 
We can write the equality above in terms of norms as 


l(a, 6, 9l] = v llall? + bl? + ell? = Ial, Hell Iel 


In words: The norm of a stacked vector is the norm of the vector formed from 
the norms of the subvectors. The right-hand side of the equation above should be 
carefully read. The outer norm symbols enclose a 3-vector, with (scalar) entries 
llall, llèll, and |jell. 


Chebyshev inequality. Suppose that x is an n-vector, and that k of its entries 
satisfy |x;| > a, where a > 0. Then k of its entries satisfy x? > a?. It follows that 


lel? =a? +- +r? > ka?, 


since k of the numbers in the sum are at least a?, and the other n — k numbers are 
nonnegative. We can conclude that k < ||a||?/a?, which is called the Chebyshev 
inequality, after the mathematician Pafnuty Chebyshev. When ||x||?/a? > n, the 
inequality tells us nothing, since we always have k < n. In other cases it limits 
the number of entries in a vector that can be large. For a > ||æ||, the inequality is 
k < |la||?/a? < 1, so we conclude that k = 0 (since k is an integer). In other words, 
no entry of a vector can be larger in magnitude than the norm of the vector. 

The Chebyshev inequality is easier to interpret in terms of the RMS value of a 


vector. We can write it as 5 
£ < (EO) (3.2) 
n a 


where k is, as above, the number of entries of x with absolute value at least a. The 
left-hand side is the fraction of entries of the vector that are at least a in absolute 


48 


3 Norm and distance 


Figure 3.1 The norm of the displacement b — a is the distance between the 
points with coordinates a and b. 


value. The right-hand side is the inverse square of the ratio of a to rms(z). It says, 
for example, that no more than 1/25 = 4% of the entries of a vector can exceed 
its RMS value by more than a factor of 5. The Chebyshev inequality partially 
justifies the idea that the RMS value of a vector gives an idea of the size of a 
typical entry: It states that not too many of the entries of a vector can be much 
bigger (in absolute value) than its RMS value. (A converse statement can also be 
made: At least one entry of a vector has absolute value as large as the RMS value 
of the vector; see exercise 3.8.) 


Distance 


Euclidean distance. We can use the norm to define the Euclidean distance be- 
tween two vectors a and b as the norm of their difference: 


dist (a,b) = ||a — bll. 


For one, two, and three dimensions, this distance is exactly the usual distance 
between points with coordinates a and b, as illustrated in figure 3.1. But the 
Euclidean distance is defined for vectors of any dimension; we can refer to the 
distance between two vectors of dimension 100. Since we only use the Euclidean 
norm in this book, we will refer to the Euclidean distance between vectors as, 
simply, the distance between the vectors. If a and b are n-vectors, we refer to the 
RMS value of the difference, ||a — b||/./n, as the RMS deviation between the two 
vectors. 

When the distance between two n-vectors x and y is small, we say they are 
‘close’ or ‘nearby’, and when the distance ||” — y|| is large, we say they are ‘far’. 
The particular numerical values of |x —y|| that correspond to ‘close’ or ‘far’ depend 
on the particular application. 
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lla — b| 


Figure 3.2 Triangle inequality. 


As an example, consider the 4-vectors 


1.8 0.6 2.0 

we 2.0 a 2.1 T 1.9 
—3.7 |” 1.9 |? —4.0 

4.7 —1.4 4.6 


The distances between pairs of them are 
|u — v|| = 8.368, |u — w|| = 0.387, |v — w|| = 8.533, 


so we can say that u is much nearer (or closer) to w than it is to v. We can also 
say that w is much nearer to u than it is to v. 


Triangle inequality. We can now explain where the triangle inequality gets its 
name. Consider a triangle in two or three dimensions, whose vertices have coordi- 
nates a, b, and c. The lengths of the sides are the distances between the vertices, 


dist(a, b) = ||a — b||, dist(b, c) = ||b — cll, dist(a,c) = ||a — cl]. 


Geometric intuition tells us that the length of any side of a triangle cannot exceed 
the sum of the lengths of the other two sides. For example, we have 


lla — ell < lla — b|| + Ilè — cll. (3.3) 
This follows from the triangle inequality, since 
lla — el] = ||(a — b) + (b—)|| < lla — bll + Ilè — cll. 


This is illustrated in figure 3.2. 
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Z4 
e 
x Z6 
o 
Z5 
e 
23 
o 
22 
e 
Z1 
Figure 3.3 A point x, shown as a square, and six other points 21,..., 26. 
The point z3 is the nearest neighbor of x among the points z1,..., Z6. 


Examples. 


e Feature distance. If x and y represent vectors of n features of two objects, 
the quantity ||x — y|| is called the feature distance, and gives a measure of 
how different the objects are (in terms of their feature values). Suppose for 
example the feature vectors are associated with patients in a hospital, with 
entries such as weight, age, presence of chest pain, difficulty breathing, and 
the results of tests. We can use feature vector distance to say that one patient 
case is near another one (at least in terms of their feature vectors). 


e RMS prediction error. Suppose that the n-vector y represents a time series 
of some quantity, for example, hourly temperature at some location, and ĝ is 
another n-vector that represents an estimate or prediction of the time series y, 
based on other information. The difference y—ĝ is called the prediction error, 
and its RMS value rms(y— ĝ) is called the RMS prediction error. If this value 
is small (say, compared to rms(y)) the prediction is good. 


e Nearest neighbor. Suppose 21,...,2m is a collection of m n-vectors, and that 
x is another n-vector. We say that zj is the nearest neighbor of x (among 
Z1,+++;2m) if 

lc — z| < le- zil, @=1,...,m. 


In words: zj is the closest vector to x among the vectors z,...,%m. This 
is illustrated in figure 3.3. The idea of nearest neighbor, and generalizations 
such as the k-nearest neighbors, are used in many applications. 


e Document dissimilarity. Suppose n-vectors x and y represent the histograms 
of word occurrences for two documents. Then ||x — y|| represents a measure 
of the dissimilarity of the two documents. We might expect the dissimilarity 
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Veterans Memorial Academy Golden Globe Super Bowl 
Day Day Awards Awards 

Veterans Day 0 0.095 0.130 0.153 0.170 
Memorial Day 0.095 0 0.122 0.147 0.164 
Academy A. 0.130 0.122 0 0.108 0.164 
Golden Globe A. 0.153 0.147 0.108 0 0.181 
Super Bowl 0.170 0.164 0.164 0.181 0 


Table 3.1 Pairwise word count histogram distances between five Wikipedia 
articles. 


to be smaller when the two documents have the same genre, topic, or author; 
we would expect it to be larger when they are on different topics, or have 
different authors. As an example we form the word count histograms for the 
5 Wikipedia articles with titles ‘Veterans Day’, ‘Memorial Day’, ‘Academy 
Awards’, ‘Golden Globe Awards’, and ‘Super Bowl’, using a dictionary of 
4423 words. (More detail is given in §4.4.) The pairwise distances between 
the word count histograms are shown in table 3.1. We can see that pairs of 
related articles have smaller word count histogram distances than less related 
pairs of articles. 


Units for heterogeneous vector entries. The square of the distance between two 
n-vectors x and y is given by 


le — yl? = (z1 = yn)? +++ + (En = Yn)’, 


the sum of the squares of the differences between their respective entries. Roughly 
speaking, the entries in the vectors all have equal status in determining the distance 
between them. For example, if x2 and y2 differ by one, the contribution to the 
square of the distance between them is the same as the contribution when x3 and 
y3 differ by one. This makes sense when the entries of the vectors x and y represent 
the same type of quantity, using the same units (say, at different times or locations), 
for example meters or dollars. For example if x and y are word count histograms, 
their entries are all word occurrence frequencies, and it makes sense to say they 
are close when their distance is small. 

When the entries of a vector represent different types of quantities, for example 
when the vector entries represent different types of features associated with an 
object, we must be careful about choosing the units used to represent the numerical 
values of the entries. If we want the different entries to have approximately equal 
status in determining distance, their numerical values should be approximately of 
the same magnitude. For this reason units for different entries in vectors are often 
chosen in such a way that their typical numerical values are similar in magnitude, 
so that the different entries play similar roles in determining distance. 

As an example suppose that the 2-vectors z, y, and z are the feature vectors 
for three houses that were sold, as in the example described on page 39. The first 
entry of each vector gives the house area and the second entry gives the number of 
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bedrooms. These are very different types of features, since the first one is a physical 
area, and the second one is a count, i.e., an integer. In the example on page 39, we 
chose the unit used to represent the first feature, area, to be thousands of square 
feet. With this choice of unit used to represent house area, the numerical values of 
both of these features range from around 1 to 5; their values have roughly the same 
magnitude. When we determine the distance between feature vectors associated 
with two houses, the difference in the area (in thousands of square feet), and the 
difference in the number of bedrooms, play equal roles. 
For example, consider three houses with feature vectors 


x = (1.6,2), y=(1.5,2), z= (1.6,4). 


The first two are ‘close’ or ‘similar’ since ||x — y|| = 0.1 is small (compared to the 
norms of x and y, which are around 2.5). This matches our intuition that the first 
two houses are similar, since they both have two bedrooms and are close in area. 
The third house would be considered ‘far’ or ‘different’ from the first two houses, 
and rightly so since it has four bedrooms instead of two. 

To appreciate the significance of our choice of units in this example, suppose 
we had chosen instead to represent house area directly in square feet, and not 
thousands of square feet. The three houses above would then be represented by 
feature vectors 


ž = (1600,2), ğ= (1500,2), %= (1600, 4). 


The distance between the first and third houses is now 2, which is very small 
compared to the norms of the vectors (which are around 1600). The distance 
between the first and second houses is much larger. It seems strange to consider 
a two-bedroom house and a four-bedroom house as ‘very close’, while two houses 
with the same number of bedrooms and similar areas are much more dissimilar. 
The reason is simple: With our choice of square feet as the unit to measure house 
area, distances are very strongly influenced by differences in area, with number of 
bedrooms playing a much smaller (relative) role. 


Standard deviation 


For any vector x, the vector ¢ = x — avg(x)1 is called the associated de-meaned 
vector, obtained by subtracting from each entry of x the mean value of the entries. 
(This is not standard notation; i.e., ¢ is not generally used to denote the de-meaned 
vector.) The mean value of the entries of &% is zero, i.e., avg(Z) = 0. This explains 
why č is called the de-meaned version of x; it is x with its mean removed. The 
de-meaned vector is useful for understanding how the entries of a vector deviate 
from their mean value. It is zero if all the entries in the original vector x are the 
same. 

The standard deviation of an n-vector x is defined as the RMS value of the 
de-meaned vector x — avg(x)1, i.e., 


eae m — avg(x))? +--+ (£n — avg(z))? 


n 
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This is the same as the RMS deviation between a vector x and the vector all of 
whose entries are avg(x). It can be written using the inner product and norm as 


_ lliz- @*a/n)1\) 


Jn 


The standard deviation of a vector x tells us the typical amount by which its entries 
deviate from their average value. The standard deviation of a vector is zero only 
when all its entries are equal. The standard deviation of a vector is small when the 
entries of the vector are nearly the same. 

As a simple example consider the vector x = (1,—2,3,2). Its mean or average 
value is avg(z) = 1, so the de-meaned vector is = (0,—3,2,1). Its standard 
deviation is std(x) = 1.872. We interpret this number as a ‘typical’ value by which 
the entries differ from the mean of the entries. These numbers are 0, 3, 2, and 1, 
so 1.872 is reasonable. 

We should warn the reader that another slightly different definition of the stan- 
dard deviation of a vector is widely used, in which the denominator y/n in (3.4) is 
replaced with yn — 1 (for n > 2). In this book we will only use the definition (3.4). 

In some applications the Greek letter ø (sigma) is traditionally used to denote 
standard deviation, while the mean is denoted u (mu). In this notation we have, 
for an n-vector 2, 


std(z) (3.4) 


patton, 0 =|le-pAll/vn. 


We will use the symbols avg(x) and std(«), switching to u and o only with expla- 
nation, when describing an application that traditionally uses these symbols. 


Average, RMS value, and standard deviation. The average, RMS value, and 
standard deviation of a vector are related by the formula 


rms(x)? = avg(x)? + std(«)’. (3.5) 


This formula makes sense: rms()? is the mean square value of the entries of x, 
which can be expressed as the square of the mean value, plus the mean square 
fluctuation of the entries of x around their mean value. We can derive this formula 
from our vector notation formula for std(x) given above. We have 

std(x)? = (1/n)||a — (17 a/n)1|? 
(1/n)(a? 2 — 2x7 (17 x/n)1 + (17 2/n)1)7 ((17 2/n)1)) 
(1/n)(a? 2 — (2/n)(1? 2)? + n(1? 2/n)?) 
(1/n)a? a — (17 2/n)? 

( 


2 avg(x)?, 


rms(z) 
which can be re-arranged to obtain the identity (3.5) above. This derivation uses 
many of the properties for norms and inner products, and should be read carefully 
to understand every step. In the second line, we expand the norm-square of the 
sum of two vectors. In the third line, we use the commutative property of scalar- 
vector multiplication, moving scalars such as (17 z/n) to the front of each term, 
and also the fact that 171 = n. 
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Examples. 


e Mean return and risk. Suppose that an n-vector represents a time series of 
return on an investment, expressed as a percentage, in n time periods over 
some interval of time. Its average gives the mean return over the whole 
interval, often shortened to its return. Its standard deviation is a measure of 
how variable the return is, from period to period, over the time interval, i.e., 
how much it typically varies from its mean, and is often called the (per period) 
risk of the investment. Multiple investments can be compared by plotting 
them on a risk-return plot, which gives the mean and standard deviation of 
the returns of each of the investments over some interval. A desirable return 
history vector has high mean return and low risk; this means that the returns 
in the different periods are consistently high. Figure 3.4 shows an example. 


e Temperature or rainfall. Suppose that an n-vector is a time series of the 
daily average temperature at a particular location, over a one year period. 
Its average gives the average temperature at that location (over the year) and 
its standard deviation is a measure of how much the temperature varied from 
its average value. We would expect the average temperature to be high and 
the standard deviation to be low in a tropical location, and the opposite for 
a location with high latitude. 


Chebyshev inequality for standard deviation. The Chebyshev inequality (3.2) 
can be transcribed to an inequality expressed in terms of the mean and standard 
deviation: If k is the number of entries of x that satisfy |x; — avg(x)| > a, then 
k/n < (std(x)/a)?. (This inequality is only interesting for a > std(x).) For 
example, at most 1/9 = 11.1% of the entries of a vector can deviate from the mean 
value avg(x) by 3 standard deviations or more. Another way to state this is: The 
fraction of entries of x within a standard deviations of avg(z) is at least 1 — 1/a? 
(for a > 1). 

As an example, consider a time series of return on an investment, with a mean 
return of 8%, and a risk (standard deviation) 3%. By the Chebyshev inequality, 
the fraction of periods with a loss (i.e., x; < 0) is no more than (3/8)? = 14.1%. 
(In fact, the fraction of periods when the return is either a loss, x; < 0, or very 
good, x; > 16%, is together no more than 14.1%.) 


Properties of standard deviation. 


e Adding a constant. For any vector x and any number a, we have std(x+a1) = 
std(xz). Adding a constant to every entry of a vector does not change its 
standard deviation. 


e Multiplying by a scalar. For any vector x and any number a, we have 
std(ax) = |a| std(x). Multiplying a vector by a scalar multiplies the standard 
deviation by the absolute value of the scalar. 
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Figure 3.4 The vectors a, b, c, d represent time series of returns on in- 


vestments over 10 periods. 


The bottom plot shows the investments in a 


risk-return plane, with return defined as the average value and risk as the 
standard deviation of the corresponding vector. 
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Figure 3.5 A 10-vector z, the de-meaned vector % = x — avg(x)1, and the 
standardized vector z = (1/std(x))%. The horizontal dashed lines indicate 
the mean and the standard deviation of each vector. The middle line is 
the mean; the distance between the middle line and the other two is the 
standard deviation. 


Standardization. For any vector x, we refer to = x — avg(x)1 as the de-meaned 
version of x, since it has average or mean value zero. If we then divide by the RMS 
value of @ (which is the standard deviation of x), we obtain the vector 


ai: tt 
-~ std(x 


Z 5 be — avg(x)1). 


This vector is called the standardized version of x. It has mean zero, and standard 
deviation one. Its entries are sometimes called the z-scores associated with the 
original entries of x. For example, z4 = 1.4 means that x4 is 1.4 standard deviations 
above the mean of the entries of x. Figure 3.5 shows an example. 

The standardized values for a vector give a simple way to interpret the original 
values in the vectors. For example, if an n-vector x gives the values of some 
medical test of n patients admitted to a hospital, the standardized values or z- 
scores tell us how high or low, compared to the population, that patient’s value is. 
A value zg = —3.2, for example, means that patient 6 has a very low value of the 
measurement; whereas z223 = 0.3 says that patient 22’s value is quite close to the 
average value. 


Angle 


Cauchy—Schwarz inequality. An important inequality that relates norms and in- 
ner products is the Cauchy—Schwarz inequality: 


lab] < Ijal] {lol 
for any n-vectors a and b. Written out in terms of the entries, this is 


lave tet ayia |X (ae toes) (ee ee ee), 
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which looks more intimidating. This inequality is attributed to the mathematician 
Augustin-Louis Cauchy; Hermann Schwarz gave the derivation given below. 

The Cauchy—Schwarz inequality can be shown as follows. The inequality clearly 
holds if a = 0 or b = 0 (in this case, both sides of the inequality are zero). So we 
suppose now that a 4 0, b 40, and define a = |la||, 8 = ||b||. We observe that 


0 < |a- abl? 
= ||Bal|? — 2(8a)? (ab) + jabl]? 
= 6 llal|? — 26a(a7b) + a? jbl]? 
= ||d||? lla]? — 2[[b||\Ial](a7) + lla]? 0)? 
2llall?’ bl? — 2llall [|b||(a7). 


Dividing by 2ļ|jal| ||b|| yields a7 < |lalj |b\|.. Applying this inequality to —a and 
b we obtain —aTb < |jal| ||b||. Putting these two inequalities together we get the 
Cauchy-Schwarz inequality, |a7b| < |lall |lbl]. 

This argument also reveals the conditions on a and b under which they satisfy 
the Cauchy-Schwarz inequality with equality. This occurs only if ||Ga — ab|| = 0, 
i.e., Ba = ab. This means that each vector is a scalar multiple of the other (in 
the case when they are nonzero). This statement remains true when either a or 
b is zero. So the Cauchy—Schwarz inequality holds with equality when one of the 
vectors is a multiple of the other; in all other cases, it holds with strict inequality. 


Verification of triangle inequality. We can use the Cauchy—Schwarz inequality to 
verify the triangle inequality. Let a and b be any vectors. Then 


lla+ 5]? = llall? +207 + [jol]? 
< llall? + 2llallllbl] + lloll? 
= (lall + Ilol)? 


where we used the Cauchy-Schwarz inequality in the second line. Taking the 
squareroot we get the triangle inequality, ||a + b|| < Ijal] + Ilol]. 


Angle between vectors. The angle between two nonzero vectors a, b is defined 


as 
0 = arccos (a 
la|] ||| 


where arccos denotes the inverse cosine, normalized to lie in the interval [0,7]. In 
other words, we define 0 as the unique number between 0 and 7 that satisfies 


a’ b = |lal| ||b|| cos 9. 


The angle between a and b is written as Z(a,b), and is sometimes expressed in 
degrees. (The default angle unit is radians; 360° is 27 radians.) For example, 
/(a,b) = 60° means /(a,b) = 7/3, i.e., ab = (1/2)|jal||lbl]. 

The angle coincides with the usual notion of angle between vectors, when they 
have dimension two or three, and they are thought of as displacements from a 
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common point. For example, the angle between the vectors a = (1,2,—1) and 
b = (2,0, —3) is 


5 
arccos | ———— ]} = arccos(0.5661) = 0.9690 = 55.52° 
© a) l ) 


(to 4 digits). But the definition of angle is more general; we can refer to the angle 
between two vectors with dimension 100. 

The angle is a symmetric function of a and b: We have (a,b) = (b,a). The 
angle is not affected by scaling each of the vectors by a positive scalar: We have, 
for any vectors a and b, and any positive numbers a and £, 


Z(aa, Bb) = (a,b). 


Acute and obtuse angles. Angles are classified according to the sign of afb. 
Suppose a and b are nonzero vectors of the same size. 


e If the angle is 7/2 = 90°, i.e., afb = 0, the vectors are said to be orthogonal. 
We write a L b if a and b are orthogonal. (By convention, we also say that a 
zero vector is orthogonal to any vector.) 


e If the angle is zero, which means a‘ b = |a||||b||, the vectors are aligned. Each 
vector is a positive multiple of the other. 


e If the angle is m = 180°, which means afb = —|Jal| ||b||, the vectors are anti- 
aligned. Each vector is a negative multiple of the other. 


e If Z(a,b) < 7/2 = 90°, the vectors are said to make an acute angle. This is 
the same as afb > 0, i.e., the vectors have positive inner product. 


e If /(a,b) > 7/2 = 90°, the vectors are said to make an obtuse angle. This is 
the same as afb < 0, i.e., the vectors have negative inner product. 


These definitions are illustrated in figure 3.6. 


Examples. 


e Spherical distance. Suppose a and 6 are 3-vectors that represent two points 
that lie on a sphere of radius R (for example, locations on earth). The 
spherical distance between them, measured along the sphere, is given by 
RZ(a,b). This is illustrated in figure 3.7. 


e Document similarity via angles. If n-vectors x and y represent the word 
counts for two documents, their angle /(x,y) can be used as a measure of 
document dissimilarity. (When using angle to measure document dissimilar- 
ity, either word counts or histograms can be used; they produce the same 
result.) As an example, table 3.2 gives the angles in degrees between the 
word histograms in the example at the end of §3.2. 
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Figure 3.6 Top row. Examples of orthogonal, aligned, and anti-aligned vec- 


tors. Bottom row. Vectors that make an obtuse and an acute angle. 


Figure 3.7 Two points a and b on a sphere with radius R and center at the 


origin. The spherical distance between the points is equal to RZ (a,b). 


Veterans Memorial Academy Golden Globe Super Bowl 


Day Day Awards Awards 
Veterans Day 0 60.6 85.7 87.0 87.7 
Memorial Day 60.6 0 85.6 87.5 87.5 
Academy A. 85.7 85.6 0 58.7 85.7 
Golden Globe A. 87.0 87.5 58.7 0 86.0 
Super Bowl 87.7 87.5 86.1 86.0 0 


Table 3.2 Pairwise angles (in degrees) between word histograms of five 


Wikipedia articles. 
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Norm of sum via angles. For vectors x and y we have 
le + yll? = llel? + 22"y + lyh? = |lel|? + 2[le| yl] cos + Ilyll?, (3.6) 


where 0 = /(x,y). (The first equality comes from (3.1).) From this we can make 
several observations. 


e If x and y are aligned (9 = 0), we have ||x + y|| = ||z|| + |ly|]. Thus, their 
norms add. 


e If z and y are orthogonal (0 = 90°), we have ||a + y||? = ||a||? + ||y||?. In this 
case the norm-squared values add, and we have ||z + y|| = ./||z||? + |ly||?. 
This formula is sometimes called the Pythagorean theorem, after the Greek 
mathematician Pythagoras of Samos. 


Correlation coefficient. Suppose a and b are n-vectors, with associated de-meaned 
vectors z 
a = a — avg(a)1, b = b — avg(b)1. 


Assuming these de-meaned vectors are not zero (which occurs when the original 
vectors have all equal entries), we define their correlation coefficient as 


aTh 


Sk 3.7 
lali llè] Ka 


p 


Thus, p = cos 0, where 0 = (à, b). We can also express the correlation coefficient in 
terms of the vectors u and v obtained by standardizing a and b. With u = ã/ std (a) 
and v = b/ std (b), we have 

p=ulv/n. (3.8) 


(We use ||u|| = lloll = vn.) 

This is a symmetric function of the vectors: The correlation coefficient between 
a and b is the same as the correlation coefficient between b and a. The Cauchy- 
Schwarz inequality tells us that the correlation coefficient ranges between —1 and 
+1. For this reason, the correlation coefficient is sometimes expressed as a per- 
centage. For example, p = 30% means p = 0.3. When p = 0, we say the vectors 
are uncorrelated. (By convention, we say that a vector with all entries equal is 
uncorrelated with any vector.) 

The correlation coefficient tells us how the entries in the two vectors vary to- 
gether. High correlation (say, p = 0.8) means that entries of a and b are typically 
above their mean for many of the same entries. The extreme case p = 1 occurs 
only if the vectors a and b are aligned, which means that each is a positive mul- 
tiple of the other, and the other extreme case p = —1 occurs only when a and b 
are negative multiples of each other. This idea is illustrated in figure 3.8, which 
shows the entries of two vectors, as well as a scatter plot of them, for cases with 
correlation near 1, near —1, and near 0. 

The correlation coefficient is often used when the vectors represent time series, 
such as the returns on two investments over some time interval, or the rainfall in 
two locations over some time interval. If they are highly correlated (say, p > 0.8), 
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Figure 3.8 Three pairs of vectors a, b of length 10, with correlation coeffi- 
cients 0.968 (top), —0.988 (middle), and 0.004 (bottom). 
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the two time series are typically above their mean values at the same times. For 
example, we would expect the rainfall time series at two nearby locations to be 
highly correlated. As another example, we might expect the returns of two similar 
companies, in the same business area, to be highly correlated. 


Standard deviation of sum. We can derive a formula for the standard deviation 
of a sum from (3.6): 


std(a +b) = \/std(a)? + 2pstd(a) std(b) + std(b)2. (3.9) 


To derive this from (3.6) we let @ and b denote the de-meaned versions of a and b. 
Then G + b is the de-meaned version of a + b, and std(a + b}? = ||à + b||? /n. Now 
using (3.6) and p = cos Z(G, b), we get 
nstd(a+b)? = |ja+4ll? 
= jäl? + 2pllăllllbl] + èll? 
= nstd(a)? + 2pnstd(a) std(b) + nstd(b)?. 
Dividing by n and taking the squareroot yields the formula above. 


If p = 1, the standard deviation of the sum of vectors is the sum of their 
standard deviations, i.e., 


std(a + b) = std(a) + std(b). 
As p decreases, the standard deviation of the sum decreases. When p = 0, i.e., a 
and b are uncorrelated, the standard deviation of the sum a + b is 
std(a +b) = \/std(a)? + std(b)?, 
which is smaller than std(a) + std(b) (unless one of them is zero). When p = —1, 


the standard deviation of the sum is as small as it can be, 


std(a + b) =|std(a) — std (b)|. 


Hedging investments. Suppose that vectors a and b are time series of returns 
for two assets with the same return (average) jz and risk (standard deviation) ø, 
and correlation coefficient p. (These are the traditional symbols used.) The vector 
c = (a+b) /2 is the time series of returns for an investment with 50% in each of the 
assets. This blended investment has the same return as the original assets, since 


avg(c) = avg((a + b)/2) = (avg(a) + avg(b))/2 = p: 


The risk (standard deviation) of this blended investment is 


std(c) = y 20? + 2p0?/2 = oy (1 + p)/2, 


using (3.9). From this we see that the risk of the blended investment is never 
more than the risk of the original assets, and is smaller when the correlation of 
the original asset returns is smaller. When the returns are uncorrelated, the risk 
is a factor 1/2 = 0.707 smaller than the risk of the original assets. If the asset 
returns are strongly negatively correlated (i.e., p is near —1), the risk of the blended 
investment is much smaller than the risk of the original assets. Investing in two 
assets with uncorrelated, or negatively correlated, returns is called hedging (which 
is short for ‘hedging your bets’). Hedging reduces risk. 
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Units for heterogeneous vector entries. When the entries of vectors represent 
different types of quantities, the choice of units used to represent each entry affects 
the angle, standard deviation, and correlation between a pair of vectors. The 
discussion on page 51, about how the choice of units can affect distances between 
pairs of vectors, therefore applies to these quantities as well. The general rule 
of thumb is to choose units for different entries so the typical vector entries have 
similar sizes or ranges of values. 


Complexity 


Computing the norm of an n-vector requires n multiplications (to square each 
entry), n — 1 additions (to add the squares), and one squareroot. Even though 
computing the squareroot typically takes more time than computing the product 
or sum of two numbers, it is counted as just one flop. So computing the norm takes 
2n flops. The cost of computing the RMS value of an n-vector is the same, since 
we can ignore the two flops involved in division by yn. Computing the distance 
between two vectors costs 3n flops, and computing the angle between them costs 
6n flops. All of these operations have order n. 

De-meaning an n-vector requires 2n flops (n for forming the average and an- 
other n flops for subtracting the average from each entry). The standard deviation 
is the RMS value of the de-meaned vector, and this calculation takes 4n flops (2n 
for computing the de-meaned vector and 2n for computing its RMS value). Equa- 
tion (3.5) suggests a slightly more efficient method with a complexity of 3n flops: 
first compute the average (n flops) and RMS value (2n flops), and then find the 
standard deviation as std(a) = (rms()?—avg(x)?)!/?. Standardizing an n-vector 
costs 5n flops. The correlation coefficient between two vectors costs 10n flops to 
compute. These operations also have order n. 

As aslightly more involved computation, suppose that we wish to determine the 
nearest neighbor among a collection of k n-vectors 21,...,2,% to another n-vector z. 
(This will come up in the next chapter.) The simple approach is to compute the 
distances ||a—z;|| for i = 1,... , k, and then find the minimum of these. (Sometimes 
a comparison of two numbers is also counted as a flop.) The cost of this is 3kn flops 
to compute the distances, and k — 1 comparisons to find the minimum. The latter 
term can be ignored, so the flop count is 3kn. The order of finding the nearest 
neighbor in a collection of k n-vectors is kn. 
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3.1 


3.2 


3.3 


3.4 


3.5 


3.6 


3.7 


3.8 


3.9 


Exercises 


Distance between Boolean vectors. Suppose that x and y are Boolean n-vectors, which 
means that each of their entries is either 0 or 1. What is their distance ||x — y||? 


RMS value and average of block vectors. Let x be a block vector with two vector elements, 
x = (a,b), where a and b are vectors of size n and m, respectively. 

(a) Express rms(x) in terms of rms(a), rms(b), m, and n. 

(b) Express avg(x) in terms of avg(a), avg(b), m, and n. 
Reverse triangle inequality. Suppose a and b are vectors of the same size. The triangle 
inequality states that ||a + b|| < |/a|| + ||b||. Show that we also have ||a + b|| > ||a|| — Ilbll- 
Hints. Draw a picture to get the idea. To show the inequality, apply the triangle inequality 
to (a+b) + (—b). 
Norm identities. Verify that the following identities hold for any two vectors a and b of 
the same size. 


(a) (a +b)" (a — b) = |lal|? — |[6||°. 
(b) Ija + ||? + lja — b||? = 2(|lal? + |]b||?). This is called the parallelogram law. 


General norms. Any real-valued function f that satisfies the four properties given on 
page 46 (nonnegative homogeneity, triangle inequality, nonnegativity, and definiteness) is 
called a vector norm, and is usually written as f(x) = ||x||mn, where the subscript is some 
kind of identifier or mnemonic to identify it. The most commonly used norm is the one we 
use in this book, the Euclidean norm, which is sometimes written with the subscript 2, 
as ||x||2. Two other common vector norms for n-vectors are the 1-norm ||z||ı and the 
oo-norm ||z||.0, defined as 


lalla = [ei +--+ |an|, lello = max{]zi],-.-,|2nl}. 


These norms are the sum and the maximum of the absolute values of the entries in the 
vector, respectively. The 1-norm and the oo-norm arise in some recent and advanced 
applications, but we will not encounter them in this book. 


Verify that the 1-norm and the oo-norm satisfy the four norm properties listed on page 46. 


Taylor approximation of norm. Find a general formula for the Taylor approximation of 
the function f(x) = ||x|| near a given nonzero vector z. You can express the approximation 
in the form f(x) = a7 (z — z) +b. 

Chebyshev inequality. Suppose x is a 100-vector with rms(x) = 1. What is the maximum 
number of entries of x that can satisfy |x;| > 3? If your answer is k, explain why no such 
vector can have k + 1 entries with absolute values at least 3, and give an example of a 
specific 100-vector that has RMS value 1, with k of its entries larger than 3 in absolute 
value. 


Converse Chebyshev inequality. Show that at least one entry of a vector has absolute 
value at least as large as the RMS value of the vector. 


Difference of squared distances. Determine whether the difference of the squared distances 
to two fixed vectors c and d, defined as 


f(x) = lla = ell? — lle = al’, 


is linear, affine, or neither. If it is linear, give its inner product representation, i.e., 
an n-vector a for which f(x) = a’ for all x. If it is affine, give a and b for which 
f(e) = aT x +b holds for all x. If it is neither linear nor affine, give specific x, y, a, and 
B for which superposition fails, i.e., 


flax + By) # af(x) + BF(y). 


(Provided a + 8 = 1, this shows the function is neither linear nor affine.) 
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Nearest neighbor document. Consider the 5 Wikipedia pages in table 3.1 on page 51. 
What is the nearest neighbor of (the word count histogram vector of) ‘Veterans Day’ 
among the others? Does the answer make sense? 


Neighboring electronic health records. Let x1,...,¢n be n-vectors that contain n features 
extracted from a set of N electronic health records (EHRs), for a population of N patients. 
(The features might involve patient attributes and current and past symptoms, diagnoses, 
test results, hospitalizations, procedures, and medications.) Briefly describe in words a 
practical use for identifying the 10 nearest neighbors of a given EHR (as measured by 
their associated feature vectors), among the other EHRs. 

Nearest point to a line. Let a and b be different n-vectors. The line passing through a 
and b is given by the set of vectors of the form (1 — 0)a + 6b, where @ is a scalar that 
determines the particular point on the line. (See page 18.) 

Let x be any n-vector. Find a formula for the point p on the line that is closest to x. 
The point p is called the projection of x onto the line. Show that (p — x) L (a — b), and 
draw a simple picture illustrating this in 2-D. Hint. Work with the square of the distance 
between a point on the line and 2, i.e., ||(1 — 0)a + 6b — ||”. Expand this, and minimize 
over 0. 


Nearest nonnegative vector. Let x be an n-vector and define y as the nonnegative vector 
(i.e., the vector with nonnegative entries) closest to x. Give an expression for the elements 


of y. Show that the vector z = y — x is also nonnegative and that z7y = 0. 


Nearest unit vector. What is the nearest neighbor of the n-vector x among the unit vectors 
Cig: 2296 y 


Average, RMS value, and standard deviation. Use the formula (3.5) to show that for any 
vector x, the following two inequalities hold: 

|avg(x)| < rms(z), std(x) < rms(z). 
Is it possible to have equality in these inequalities? If | avg(x)| = rms(z) is possible, give 
the conditions on x under which it holds. Repeat for std(x) = rms(z). 


Effect of scaling and offset on average and standard deviation. Suppose x is an n-vector 
and a and £ are scalars. 


(a) Show that avg(ax + 61) = a avg(x) + 2. 
(b) Show that std(az + 81) = |a| std(z). 


Average and standard deviation of linear combination. Let £1,..., £k be n-vectors, and 
ai1,...,Q@,% be numbers, and consider the linear combination z = a14%1 +--+ apap. 
(a) Show that avg(z) = a1 avg(r1) +--+: + ax avg(zp). 
(b) Now suppose the vectors are uncorrelated, which means that for i 4 j, x; and x; 
are uncorrelated. Show that std(z) = \/a? std(x1)? +--+ + a? std(ax)?. 
Triangle equality. When does the triangle inequality hold with equality, i.e., what are the 


conditions on a and b to have ||a + b|| = |jal| + Ibl]? 
Norm of sum. Use the formulas (3.1) and (3.6) to show the following: 


(a) a Lb if and only if ||a + b|| = 4/||a||2 + |/B]/?. 
(b) Nonzero vectors a and b make an acute angle if and only if ||a +b|| > ./||a||? + |[6||?. 
(c) Nonzero vectors a and b make an obtuse angle if and only if ||a+b|| < ./||a||? + |[b||?. 


Draw a picture illustrating each case in 2-D. 


Regression model sensitivity. Consider the regression model 7 = x7 6 +v, where ĝ is the 
prediction, x is a feature vector, 8 is a coefficient vector, and v is the offset term. If x and & 
are feature vectors with corresponding predictions ĝ and g, show that |g—g| < ||A||||c—#|]. 
This means that when ||(|| is small, the prediction is not very sensitive to a change in the 
feature vector. 
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3.21 


3.22 


3.23 


3.24 


Dirichlet energy of a signal. Suppose the T-vector x represents a time series or signal. 
The quantity 


D(x) = (x1 — 22)? +--+ + (£r-1 — £T)’, 
the sum of the differences of adjacent values of the signal, is called the Dirichlet energy of 
the signal, named after the mathematician Peter Gustav Lejeune Dirichlet. The Dirichlet 
energy is a measure of the roughness or wiggliness of the time series. It is sometimes 
divided by T — 1, to give the mean square difference of adjacent values. 


(a) Express D(x) in vector notation. (You can use vector slicing, vector addition or 
subtraction, inner product, norm, and angle.) 


(b) How small can D(x) be? What signals x have this minimum value of the Dirichlet 
energy? 


(c) Find a signal x with entries no more than one in absolute value that has the largest 
possible value of D(x). Give the value of the Dirichlet energy achieved. 


Distance from Palo Alto to Beijing. The surface of the earth is reasonably approximated 
as a sphere with radius R = 6367.5km. A location on the earth’s surface is traditionally 
given by its latitude @ and its longitude A, which correspond to angular distance from the 
equator and prime meridian, respectively. The 3-D coordinates of the location are given 
by 

Rsin A cos 0 

R cos À cos 0 

Rsin 0 


(In this coordinate system (0,0, 0) is the center of the earth, R(0,0,1) is the North pole, 
and R(0,1,0) is the point on the equator on the prime meridian, due south of the Royal 
Observatory outside London.) 

The distance through the earth between two locations (3-vectors) a and b is ||a — b||. The 
distance along the surface of the earth between points a and b is RZ(a,b). Find these two 
distances between Palo Alto and Beijing, with latitudes and longitudes given below. 


City Latitude 6 Longitude A 
Beijing 39.914° 116.392° 
Palo Alto 37.429° —122.138° 


Angle between two nonnegative vectors. Let x and y be two nonzero n-vectors with 
nonnegative entries, i.e., each x; > 0 and each y; > 0. Show that the angle between x 
and y lies between 0 and 90°. Draw a picture for the case when n = 2, and give a short 
geometric explanation. When are x and y orthogonal? 


Distance versus angle nearest neighbor. Suppose 21,...,2m is a collection of n-vectors, 
and x is another n-vector. The vector z; is the (distance) nearest neighbor of x (among 
the given vectors) if 

læ — z;l| < lla — zl, t=1,...,m, 


i.e., x has smallest distance to zj. We say that z; is the angle nearest neighbor of x if 
L(a,zj;) < Z(%, z%), i=1,...,m, 
i.e., x has smallest angle to zj. 


(a) Give a simple specific numerical example where the (distance) nearest neighbor is 
not the same as the angle nearest neighbor. 


(b) Now suppose that the vectors z1,..., Zm are normalized, which means that ||z;|| = 1, 
i = 1,...,m. Show that in this case the distance nearest neighbor and the angle 
nearest neighbor are always the same. Hint. You can use the fact that arccos 
is a decreasing function, i.e., for any u and v with —1 < u < v < 1, we have 
arccos(u) > arccos(v). 


3.25 


3.26 


3.27 
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Leveraging. Consider an asset with return time series over T periods given by the T- 
vector r. This asset has mean return yz and risk g, which we assume is positive. We also 
consider cash as an asset, with return vector 71, where i"! is the cash interest rate per 
period. Thus, we model cash as an asset with return p™ and zero risk. (The superscript 
in u” stands for ‘risk-free’.) We will create a simple portfolio consisting of the asset and 
cash. If we invest a fraction 0 in the asset, and 1 — @ in cash, our portfolio return is given 
by the time series 
p= br + (1 — 0)u“1. 

We interpret 0 as the fraction of our portfolio we hold in the asset. We allow the choices 
0 > 1,or0 <0. In the first case we are borrowing cash and using the proceeds to buy 
more of the asset, which is called leveraging. In the second case we are shorting the asset. 
When @ is between 0 and 1 we are blending our investment in the asset and cash, which 
is a form of hedging. 


(a) Derive a formula for the return and risk of the portfolio, i.e., the mean and standard 
deviation of p. These should be expressed in terms of ju, o, pe, and @. Check your 
formulas for the special cases 0 = 0 and 0 = 1. 


(b) Explain how to choose @ so the portfolio has a given target risk level 0°” (which is 
positive). If there are multiple values of 6 that give the target risk, choose the one 
that results in the highest portfolio return. 


(c) Assume we choose the value of 0 as in part (b). When do we use leverage? When 
do we short the asset? When do we hedge? Your answers should be in English. 


Time series auto-correlation. Suppose the T-vector x is a non-constant time series, with 
xı the value at time (or period) t. Let u = (17x)/T denote its mean value. The auto- 
correlation of x is the function R(T), defined for 7 = 0,1,... as the correlation coefficient 
of the two vectors (x, 41,) and (w1-,,x). (The subscript 7 denotes the length of the ones 
vector.) Both of these vectors also have mean u. Roughly speaking, R(T) tells us how 
correlated the time series is with a version of itself lagged or shifted by r periods. (The 
argument 7 is called the lag.) 


(a) Explain why R(0) = 1, and R(T) = 0 for r > T. 


(b) Let z denote the standardized or z-scored version of x (see page 56). Show that for 
T=0,..., ne 


T-T 
1 
R(T) = T 5 ZZt+r- 
t=1 


(c) Find the auto-correlation for the time series x = (+1, —1, +1, —1,...,+1,—1). You 
can assume that T is even. 


(d) Suppose x denotes the number of meals served by a restaurant on day r. It is 
observed that R(7) is fairly high, and R(14) is also high, but not as high. Give an 
English explanation of why this might be. 


Another measure of the spread of the entries of a vector. The standard deviation is 
a measure of how much the entries of a vector differ from their mean value. Another 
measure of how much the entries of an n-vector x differ from each other, called the mean 
square difference, is defined as 


n 


MSD(z) = = CETE 


i, j=1 


(The sum means that you should add up the n? terms, as the indices i and j each range 
from 1 to n.) Show that MSD(x) = 2std(x)?. Hint. First observe that MSD(#) = 
MSD(xz), where & = x — avg(x)1 is the de-meaned vector. Expand the sum and recall 
that X; a= 0. 
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3.28 Weighted norm. On page 51 we discuss the importance of choosing the units or scaling for 


the individual entries of vectors, when they represent heterogeneous quantities. Another 
approach is to use a weighted norm of a vector x, defined as 


lzlu = Vrs? +--+ waa, 


where w1,..., Wn are given positive weights, used to assign more or less importance to the 
different elements of the n-vector x. If all the weights are one, the weighted norm reduces 
to the usual (‘unweighted’) norm. It can be shown that the weighted norm is a general 
norm, t.e., it satisfies the four norm properties listed on page 46. Following the discussion 
on page 51, one common rule of thumb is to choose the weight w; as the inverse of the 
typical value of £? in the application. 

A version of the Cauchy—Schwarz inequality holds for weighted norms: For any n-vector 
x and y, we have 

Jwrtryr +++ + Wntnyn| < ||2Ihwllylle- 

(The expression inside the absolute value on the left-hand side is sometimes called the 


weighted inner product of x and y.) Show that this inequality holds. Hint. Consider the 


vectors & = (@1,/Wi1,...,EnVWn) and J = (y1V/W1,---, Yny/Wn), and use the (standard) 
Cauchy—Schwarz inequality. 


4.1 


Chapter 4 


Clustering 


In this chapter we consider the task of clustering a collection of vectors into groups 
or clusters of vectors that are close to each other, as measured by the distance 
between pairs of them. We describe a famous clustering method, called the k- 
means algorithm, and give some typical applications. 

The material in this chapter will not be used in the sequel. But the ideas, and 
the k-means algorithm in particular, are widely used in practical applications, and 
rely only on the ideas developed in the previous three chapters. So this chapter 
can be considered an interlude that covers useful material that builds on the ideas 
developed so far. 


Clustering 


Suppose we have N n-vectors, 21,...,%x. The goal of clustering is to group or 
partition the vectors (if possible) into k groups or clusters, with the vectors in each 
group close to each other. Clustering is very widely used in many application areas, 
typically (but not always) when the vectors represent features of objects. 

Normally we have k much smaller than N, i.e., there are many more vectors 
than groups. Typical applications use values of k that range from a handful to a 
few hundred or more, with values of N that range from hundreds to billions. Part 
of the task of clustering a collection of vectors is to determine whether or not the 
vectors can be divided into k groups, with vectors in each group near each other. 
Of course this depends on k, the number of clusters, and the particular data, i.e., 
the vectors z1,..., ZN. 

Figure 4.1 shows a simple example, with N = 300 2-vectors, shown as small 
circles. We can easily see that this collection of vectors can be divided into k = 3 
clusters, shown on the right with the colors representing the different clusters. We 
could partition these data into other numbers of clusters, but we can see that k = 3 
is a good value. 

This example is not typical in several ways. First, the vectors have dimension 
n = 2. Clustering any set of 2-vectors is easy: We simply scatter plot the values 
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Figure 4.1 300 points in a plane. The points can be clustered in the three 
groups shown on the right. 


and check visually if the data are clustered, and if so, how many clusters there are. 
In almost all applications n is larger than 2 (and typically, much larger than 2), 
in which case this simple visual method cannot be used. The second way in which 
it is not typical is that the points are very well clustered. In most applications, 
the data are not as cleanly clustered as in this simple example; there are several or 
even many points that lie in between clusters. Finally, in this example, it is clear 
that the best choice of k is k = 3. In real examples, it can be less clear what the 
best value of k is. But even when the clustering is not as clean as in this example, 
and the best value of k is not clear, clustering can be very useful in practice. 


Examples. Before we delve more deeply into the details of clustering and cluster- 
ing algorithms, we list some common applications where clustering is used. 


e Topic discovery. Suppose x; are word histograms associated with N docu- 
ments. A clustering algorithm partitions the documents into k groups, which 
typically can be interpreted as groups of documents with the same or similar 
topics, genre, or author. Since the clustering algorithm runs automatically 
and without any understanding of what the words in the dictionary mean, 
this is sometimes called automatic topic discovery. 


e Patient clustering. If x; are feature vectors associated with N patients ad- 
mitted to a hospital, a clustering algorithm clusters the patients into k groups 
of similar patients (at least in terms of their feature vectors). 


e Customer market segmentation. Suppose the vector x; gives the quantities (or 
dollar values) of n items purchased by customer i over some period of time. A 
clustering algorithm will group the customers into k market segments, which 
are groups of customers with similar purchasing patterns. 


4.1 
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ZIP code clustering. Suppose that x; is a vector giving n quantities or statis- 
tics for the residents of ZIP code i, such as numbers of residents in various 
age groups, household size, education statistics, and income statistics. (In 
this example N is around 40000.) A clustering algorithm might be used to 
cluster the 40000 ZIP codes into, say, k = 100 groups of ZIP codes with 
similar statistics. 


Student clustering. Suppose the vector x; gives the detailed grading record 
of student i in a course, i.e., her grades on each question in the quizzes, 
homework assignments, and exams. A clustering algorithm might be used to 
cluster the students into k = 10 groups of students who performed similarly. 


Survey response clustering. A group of N people respond to a survey with n 
questions. Each question contains a statement, such as ‘The movie was too 
long’, followed by some ordered options such as 


Strongly Disagree, Disagree, Neutral, Agree, Strongly Agree. 


(This is called a Likert scale, named after the psychologist Rensis Likert.) 
Suppose the n-vector x; encodes the selections of respondent i on the n 
questions, using the numerical coding —2, —1, 0, +1, +2 for the responses 
above. A clustering algorithm can be used to cluster the respondents into k 
groups, each with similar responses to the survey. 


Weather zones. For each of N counties we have a 24-vector x; that gives the 
average monthly temperature in the first 12 entries and the average monthly 
rainfall in the last 12 entries. (We can standardize all the temperatures, and 
all the rainfall data, so they have a typical range between —1 and +1.) The 
vector x; summarizes the annual weather pattern in county i. A clustering 
algorithm can be used to cluster the counties into k groups that have similar 
weather patterns, called weather zones. This clustering can be shown on a 
map, and used to recommend landscape plantings depending on zone. 


Daily energy use patterns. The 24-vectors x; give the average (electric) en- 
ergy use for N customers over some period (say, a month) for each hour of 
the day. A clustering algorithm partitions customers into groups, each with 
similar patterns of daily energy consumption. We might expect a clustering 
algorithm to ‘discover’ which customers have a swimming pool, an electric 
water heater, or solar panels. 


Financial sectors. For each of N companies we have an n-vector whose com- 
ponents are financial and business attributes such as total capitalization, 
quarterly returns and risks, trading volume, profit and loss, or dividends 
paid. (These quantities would typically be scaled so as to have similar ranges 
of values.) A clustering algorithm would group companies into sectors, i.e., 
groups of companies with similar attributes. 


In each of these examples, it would be quite informative to know that the vectors 


can be well clustered into, say, k = 5 or k = 37 groups. This can be used to develop 
insight into the data. By examining the clusters we can often understand them, 
and assign labels or descriptions to them. 
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4.2 A clustering objective 


In this section we formalize the idea of clustering, and introduce a natural measure 
of the quality of a given clustering. 


Specifying the cluster assignments. We specify a clustering of the vectors by 
saying which cluster or group each vector belongs to. We label the groups 1,...,k, 
and specify a clustering or assignment of the N given vectors to groups using an 
N-vector c, where c; is the group (number) that the vector x; is assigned to. As 
a simple example with N = 5 vectors and k = 3 groups, c = (3,1,1,1,2) means 
that xı is assigned to group 3, 2, x3, and x4 are assigned to group 1, and zs is 
assigned to group 2. We will also describe the clustering by the sets of indices for 
each group. We let Gj; be the set of indices corresponding to group j. For our 
simple example above, we have 


G, = {2,3, 4}, G2 = {5}, G3 = {1}. 


(Here we are using the notation of sets; see appendix A.) Formally, we can express 
these index sets in terms of the group assignment vector c as 


Gj = {i| & =j}, 


which means that G; is the set of all indices 7 for which c; = j. 


Group representatives. With each of the groups we associate a group represen- 
tative n-vector, which we denote 21,...,Zķ.- These representatives can be any 
n-vectors; they do not need to be one of the given vectors. We want each rep- 
resentative to be close to the vectors in its associated group, i.e., we want the 
quantities 


EZ — Zei 


to be small. (Note that x; is in group j = Ci, so 2, is the representative vector 
associated with data vector 2;.) 


A clustering objective. We can now give a single number that we use to judge a 
choice of clustering, along with a choice of the group representatives. We define 


Jost _ (læ zall? pe |en Zell) /N, (4.1) 


which is the mean square distance from the vectors to their associated represen- 
tatives. Note that Jest depends on the cluster assignments (i.e., c), as well as 
the choice of the group representatives z1, ..., zk. The smaller J™®St is, the better 
the clustering. An extreme case is JSt = 0, which means that the distance be- 
tween every original vector and its assigned representative is zero. This happens 
only when the original collection of vectors only takes k different values, and each 
vector is assigned to the representative it is equal to. (This extreme case would 
probably not occur in practice.) 

Our choice of clustering objective J“"St makes sense, since it encourages all 
points to be near their associated representative, but there are other reasonable 
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choices. For example, it is possible to use an objective that encourages more 
balanced groupings. But we will stick with this basic (and very common) choice of 
clustering objective. 


Optimal and suboptimal clustering. We seek a clustering, i.e., a choice of group 
assignments c,,...,cy and a choice of representatives z1,..., 2, that minimize the 
objective Jt, We call such a clustering optimal. Unfortunately, for all but the 
very smallest problems, it is practically impossible to find an optimal clustering. (It 
can be done in principle, but the amount of computation needed grows extremely 
rapidly with N.) The good news is that the k-means algorithm described in the 
next section requires far less computation (and indeed, can be run for problems 
with N measured in billions), and often finds a very good, if not the absolute best, 
clustering. (Here, ‘very good’ means a clustering and choice of representatives 
that achieves a value of J‘'"St near its smallest possible value.) We say that the 
clustering choices found by the k-means algorithm are suboptimal, which means 
that they might not give the lowest possible value of Jet, 

Even though it is a hard problem to choose the best clustering and the best 
representatives, it turns out that we can find the best clustering, if the representa- 
tives are fixed, and we can find the best representatives, if the clustering is fixed. 
We address these two topics now. 


Partitioning the vectors with the representatives fixed. Suppose that the group 
representatives z1,...,2, are fixed, and we seek the group assignments c),...,CN 
that achieve the smallest possible value of J°'™St. It turns out that this problem 
can be solved exactly. 

The objective JSt is a sum of N terms. The choice of c; (i.e., the group 
to which we assign the vector x;) only affects the ith term in J&t, which is 
(1/N)||ai; — Zc,||?. We can choose c; to minimize just this term, since c; does not 
affect the other N — 1 terms in Jt, How do we choose c; to minimize this term? 
This is easy: We simply choose c; to be the value of j that minimizes ||x; — z,|| 
over j. In other words, we should assign each data vector x; to its nearest neighbor 
among the representatives. This choice of assignment is very natural, and easily 
carried out. 

So when the group representatives are fixed, we can readily find the best group 
assignment (i.e., the one that minimizes J‘St), by assigning each vector to its 
nearest representative. With this choice of group assignment, we have (by the way 
the assignment is made) 


zi = zell = aan A læ: — zzl, 


yeas 


so the value of J®St is given by 


; Zee Bes cen i - zll? 
(anim le- 2 + apin len = 2) N. 


This has a simple interpretation: It is the mean of the squared distance from the 
data vectors to their closest representative. 
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Optimizing the group representatives with the assignment fixed. Now we turn 
to the problem of choosing the group representatives, with the clustering (group 
assignments) fixed, in order to minimize our objective Jt, It turns out that this 
problem also has a simple and natural solution. 

We start by re-arranging the sum of N terms into k sums, each associated with 
one group. We write 


cue =Ji+ -+ Jk, 


where 


J; = (1/N) X lle: - zll? 


iEGj 


is the contribution to the objective Jst from the vectors in group j. (The sum 
here means that we should add up all terms of the form ||x; — z;||?, for any i € G}, 
i.e., for any vector x; in group j; see appendix A.) 

The choice of group representative z; only affects the term J}; it has no effect 
on the other terms in J®™st, So we can choose each z; to minimize Jj. Thus we 
should choose the vector z; so as to minimize the mean square distance to the 
vectors in group j. This problem has a very simple solution: We should choose zj 
to be the average (or mean or centroid) of the vectors x; in its group: 


zj = (1/1G;)) SO a, 


i€G; 


where |G;| is standard mathematical notation for the number of elements in the 
set Gj, i.e., the size of group j. (See exercise 4.1.) 

So if we fix the group assignments, we minimize Jt by choosing each group 
representative to be the average or centroid of the vectors assigned to its group. 
(This is sometimes called the group centroid or cluster centroid.) 


The k-means algorithm 


It might seem that we can now solve the problem of choosing the group assignments 
and the group representatives to minimize JSt, since we know how to do this when 
one or the other choice is fixed. But the two choices are circular, i.e., each depends 
on the other. Instead we rely on a very old idea in computation: We simply 
iterate between the two choices. This means that we repeatedly alternate between 
updating the group assignments, and then updating the representatives, using the 
methods developed above. In each step the objective JSt gets better (i.e., goes 
down) unless the step does not change the choice. Iterating between choosing 
the group representatives and choosing the group assignments is the celebrated 
k-means algorithm for clustering a collection of vectors. 

The k-means algorithm was first proposed in 1957 by Stuart Lloyd, and inde- 
pendently by Hugo Steinhaus. It is sometimes called the Lloyd algorithm. The 
name ‘k-means’ has been used since the 1960s. 
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Figure 4.2 One iteration of the k-means algorithm. The 30 2-vectors x; 
are shown as filled circles, and the 3 group representatives z; are shown as 
rectangles. In the left-hand figure the vectors are each assigned to the closest 
representative. In the right-hand figure, the representatives are replaced by 
the cluster centroids. 


Algorithm 4.1 k-MEANS ALGORITHM 


given a list of N vectors 71,...,xn, and an initial list of k group representative 
vectors 21,...,2k 


repeat until convergence 


1. Partition the vectors into k groups. For each vector i = 1,..., N, assign x; to 
the group associated with the nearest representative. 

2. Update representatives. For each group j = 1,...,k, set zj to be the mean of 
the vectors in group j. 


One iteration of the k-means algorithm is illustrated in figure 4.2. 


Comments and clarifications. 


e Ties in step 1 can be broken by assigning x; to the group associated with one 
of the closest representatives with the smallest value of j. 


e It is possible that in step 1, one or more of the groups can be empty, i.e., 
contain no vectors. In this case we simply drop this group (and its repre- 
sentative). When this occurs, we end up with a partition of the vectors into 
fewer than k groups. 
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e If the group assignments found in step 1 are the same in two successive 
iterations, the representatives in step 2 will also be the same. It follows that 
the group assignments and group representatives will never change in future 
iterations, so we should stop the algorithm. This is what we mean by ‘until 
convergence’. In practice, one often stops the algorithm earlier, as soon as 
the improvement in J°™*t in successive iterations becomes very small. 


e We start the algorithm with a choice of initial group representatives. One 
simple method is to pick the representatives randomly from the original vec- 
tors; another is to start from a random assignment of the original vectors to k 
groups, and use the means of the groups as the initial representatives. (There 
are more sophisticated methods for choosing an initial representatives, but 
this topic is beyond the scope of this book.) 


Convergence. The fact that J™®St decreases in each step implies that the k-means 
algorithm converges in a finite number of steps. However, depending on the initial 
choice of representatives, the algorithm can, and does, converge to different final 
partitions, with different objective values. 

The k-means algorithm is a heuristic, which means it cannot guarantee that the 
partition it finds minimizes our objective Jest, For this reason it is common to 
run the k-means algorithm several times, with different initial representatives, and 
choose the one among them with the smallest final value of J°'S'. Despite the fact 
that the k-means algorithm is a heuristic, it is very useful in practical applications, 
and very widely used. 

Figure 4.3 shows a few iterations generated by the k-means algorithm, applied 
to the example of figure 4.1. We take k = 3 and start with randomly chosen group 
representatives. The final clustering is shown in figure 4.4. Figure 4.5 shows how 
the clustering objective decreases in each step. 


Interpreting the representatives. The representatives z,,...,z% associated with 
a clustering are quite interpretable. Suppose, for example, that voters in some 
election can be well clustered into 7 groups, on the basis of a data set that includes 
demographic data and questionnaire or poll data. If the 4th component of our 
vectors is the age of the voter, then (z3)4 = 37.8 tells us that the average age 
of voters in group 3 is 37.8. Insight gained from this data can be used to tune 
campaign messages, or choose media outlets for campaign advertising. 

Another way to interpret the group representatives is to find one or a few of the 
original data points that are closest to each representive. These can be thought of 
as archetypes for the group. 


Choosing k. It is common to run the k-means algorithm for different values of k, 
and compare the results. How to choose a value of k among these depends on how 
the clustering will be used, which we discuss a bit more in $4.5. But some general 
statements can be made. For example, if the value of J°'St with k = 7 is quite a 
bit smaller than the values of JSt for k = 2,...,6, and not much larger than the 
values of Jt for k = 8,9,..., we could reasonably choose k = 7, and conclude 
that our data (list of vectors) partitions nicely into 7 groups. 
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Figure 4.3 Three iterations of the k-means algorithm. The group representa- 
tives are shown as squares. In each row, the left-hand plot shows the result 
of partitioning the vectors in the 3 groups (step 1 of algorithm 4.1). The 
right-hand plot shows the updated representatives (step 2 of the algorithm). 
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Figure 4.4 Final clustering. 
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Figure 4.5 The clustering objective after step 1 of each iteration. 
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Complexity. In step 1 of the k-means algorithm, we find the nearest neighbor 
to each of N n-vectors, over the list of k centroids. This requires approximately 
3Nkn flops. In step 2 we average the n-vectors over each of the cluster groups. 
For a cluster with p vectors, this requires n(p — 1) flops, which we approximate as 
np flops; averaging all clusters requires a total of Nn flops. This is less than the 
cost of partitioning in step 1. So k-means requires around (3k + 1)Nn flops per 
iteration. Its order is Nkn flops. 

Each run of k-means typically takes fewer than a few tens of iterations, and 
usually k-means is run some modest number of times, like 10. So a very rough 
guess of the number of flops required to run k-means 10 times (in order to choose 
the best partition found) is 1000Nkn flops. 

As an example, suppose we use k-means to partition N = 100000 vectors with 
size n = 100 into k = 10 groups. On a 1 Gflop/s computer we guess that this will 
take around 100 seconds. Given the approximations made here (for example, the 
number of iterations that each run of k-means will take), this is obviously a crude 
estimate. 


Examples 


Image clustering 


The MNIST (Mixed National Institute of Standards) database of handwritten dig- 
its is a data set containing N = 60000 grayscale images of size 28 x 28, which 
we represent as n-vectors with n = 28 x 28 = 784. Figure 4.6 shows a few 
examples from the data set. (The data set is available from Yann LeCun at 
yann.lecun.com/exdb/mnist.) 

We use the k-means algorithm to partition these images into k = 20 clusters, 
starting with a random assignment of the vectors to groups, and repeating the 
experiment 20 times. Figure 4.7 shows the clustering objective versus iteration 
number for three of the 20 initial assignments, including the two that gave the 
lowest and the highest final values of the objective. 

Figure 4.8 shows the representatives with the lowest final value of the clustering 
objective. Figure 4.9 shows the set with the highest value. We can see that most 
of the representatives are recognizable digits, with some reasonable confusion, for 
example between ‘4’ and ‘9’ or ‘3’ and ‘8’. This is impressive when you consider 
that the k-means algorithm knows nothing about digits, handwriting, or even that 
the 784-vectors represent 28 x 28 images; it uses only the distances between 784- 
vectors. One interpretation is that the k-means algorithm has ‘discovered’ the 
digits in the data set. 
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Figure 4.6 25 images of handwritten digits from the MNIST data set. Each 
image has size 28 x 28, and can be represented by a 784-vector. 
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Figure 4.7 Clustering objective J™* after each iteration of the k-means 
algorithm, for three initial partitions, on digits of the MNIST set. 
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Figure 4.8 Group representatives found by the k-means algorithm applied 
to the MNIST set. 
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Figure 4.9 Group representatives found by the k-means algorithm applied 
to the MNIST set, with a different starting point than in figure 4.8. 
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Figure 4.10 Clustering objective J™* after each iteration of the k-means 
algorithm, for three initial partitions, on Wikipedia word count histograms. 


Document topic discovery 


We start with a corpus of N = 500 Wikipedia articles, compiled from weekly lists 
of the most popular articles between September 6, 2015, and June 11, 2016. We 
remove the section titles and reference sections (bibliography, notes, references, 
further reading), and convert each document to a list of words. The conversion 
removes numbers and stop words, and applies a stemming algorithm to nouns and 
verbs. We then form a dictionary of all the words that appear in at least 20 
documents. This results in a dictionary of 4423 words. Each document in the 
corpus is represented by a word histogram vector of length 4423. 

We apply the k-means algorithm with k = 9, and 20 randomly chosen initial 
partitions. The k-means algorithm converges to similar but slightly different clus- 
terings of the documents in each case. Figure 4.10 shows the clustering objective 
versus iteration of the k-means algorithm for three of these, including the one that 
gave the lowest final value of JSt, which we use below. 

Table 4.1 summarizes the clustering with the lowest value of Jt, For each of 
the nine clusters we show the largest ten coefficients of the word histogram of the 
cluster representative. Table 4.2 gives the size of each cluster and the titles of the 
ten articles closest to the cluster representative. 

Each of the clusters makes sense, and mostly contains documents on similar 
topics, or similar themes. The words with largest coefficients in the group rep- 
resentatives also make sense. It is interesting to see that k-means clustering has 
assigned movies and TV series (mostly) to different clusters (9 and 6). One can 
also note that clusters 8 and 9 share several top key words but are on separate 
topics (actors and movies, respectively). 


4.4 Examples 


Cluster 1 Cluster 2 Cluster 3 
Word Coefficient Word Coefficient Word Coefficient 
fight 0.038 holiday 0.012 united 0.004 

win 0.022 celebrate 0.009 family 0.003 
event 0.019 festival 0.007 party 0.003 
champion 0.015 celebration 0.007 president 0.003 
fighter 0.015 calendar 0.006 government 0.003 
bout 0.015 church 0.006 school 0.003 
title 0.013 united 0.005 American 0.003 
Ali 0.012 date 0.005 university 0.003 
championship 0.011 moon 0.005 city 0.003 
bonus 0.010 event 0.005 attack 0.003 
Cluster 4 Cluster 5 Cluster 6 
Word Coefficient Word Coefficient Word Coefficient 
album 0.031 game 0.023 series 0.029 
release 0.016 season 0.020 season 0.027 
song 0.015 team 0.018 episode 0.013 
music 0.014 win 0.017 character 0.011 
single 0.011 player 0.014 film 0.008 
record 0.010 play 0.013 television 0.007 
band 0.009 league 0.010 cast 0.006 
perform 0.007 final 0.010 announce 0.006 
tour 0.007 score 0.008 release 0.005 
chart 0.007 record 0.007 appear 0.005 
Cluster 7 Cluster 8 Cluster 9 
Word Coefficient Word Coefficient Word Coefficient 
match 0.065 film 0.036 film 0.061 
win 0.018 star 0.014 million 0.019 
championship 0.016 role 0.014 release 0.013 
team 0.015 play 0.010 star 0.010 
event 0.015 series 0.009 character 0.006 
style 0.014 appear 0.008 role 0.006 
raw 0.013 award 0.008 movie 0.005 
title 0.011 actor 0.007 weekend 0.005 
episode 0.010 character 0.006 story 0.005 
perform 0.010 release 0.006 gross 0.005 


Table 4.1 The 9 cluster representatives. For each representative we show 
the largest 10 coefficients of the word histogram. 
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Cluster Size Titles 


1 21 Floyd Mayweather, Jr; Kimbo Slice; Ronda Rousey; José 
Aldo; Joe Frazier; Wladimir Klitschko; Saul Alvarez; 
Gennady Golovkin; Nate Diaz; Conor McGregor. 


2 43 Halloween; Guy Fawkes Night; Diwali; Hanukkah; 
Groundhog Day; Rosh Hashanah; Yom Kippur; 
Seventh-day Adventist Church; Remembrance Day; 
Mother’s Day. 


3 189 Mahatma Gandhi; Sigmund Freud; Carly Fiorina; 
Frederick Douglass; Marco Rubio; Christopher Columbus; 
Fidel Castro; Jim Webb; Genie (feral child); Pablo 


Escobar. 

4 46 David Bowie; Kanye West; Celine Dion; Kesha; Ariana 
Grande; Adele; Gwen Stefani; Anti (album); Dolly Parton; 
Sia Furler. 

5 49 Kobe Bryant; Lamar Odom; Johan Cruyff; Yogi Berra; 


José Mourinho; Halo 5: Guardians; Tom Brady; Eli 
Manning; Stephen Curry; Carolina Panthers. 


6 39 The X-Files; Game of Thrones; House of Cards (U.S. TV 
series); Daredevil (TV series); Supergirl (U.S. TV series); 
American Horror Story; The Flash (2014 TV series); The 
Man in the High Castle (TV series); Sherlock (TV series); 
Scream Queens (2015 TV series). 


7 16 Wrestlemania 32; Payback (2016); Survivor Series (2015); 
Royal Rumble (2016); Night of Champions (2015); 
Fastlane (2016); Extreme Rules (2016); Hell in a Cell 
(2015); TLC: Tables, Ladders & Chairs (2015); Shane 
McMahon. 


8 58 Ben Affleck; Johnny Depp; Maureen O’Hara; Kate 
Beckinsale; Leonardo DiCaprio; Keanu Reeves; Charlie 
Sheen; Kate Winslet; Carrie Fisher; Alan Rickman. 


9 39 Star Wars: The Force Awakens; Star Wars Episode I: The 
Phantom Menace; The Martian (film); The Revenant 
(2015 film); The Hateful Eight; Spectre (2015 film); The 
Jungle Book (2016 film); Bajirao Mastani (film); Back to 
the Future II; Back to the Future. 


Table 4.2 Cluster sizes and titles of 10 documents closest to the cluster 
representatives. 
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The identification of these separate topics among the documents is impressive, 
when you consider that the k-means algorithm does not understand the meaning of 
the words in the documents (and indeed, does not even know the order of the words 
in each document). It uses only the simple concept of document dissimilarity, as 
measured by the distance between word count histogram vectors. 


Applications 


Clustering, and the k-means algorithm in particular, has many uses and applica- 
tions. It can be used for exploratory data analysis, to get an idea of what a large 
collection of vectors ‘looks like’. When k is small enough, say less than a few tens, 
it is common to examine the group representatives, and some of the vectors in the 
associated groups, to interpret or label the groups. Clustering can also be used for 
more specific directed tasks, a few of which we describe below. 


Classification. We cluster a large collection of vectors into k groups, and label 
the groups by hand. We can now assign (or classify) new vectors to one of the 
k groups by choosing the nearest group representative. In our example of the 
handwritten digits above, this would give us a rudimentary digit classifier, which 
would automatically guess what a written digit is from its image. In the topic 
discovery example, we can automatically classify new documents into one of the k 
topics. (We will see better classification methods in chapter 14.) 


Recommendation engine. Clustering can be used to build a recommendation en- 
gine, which suggests items that a user or customer might be interested in. Suppose 
the vectors give the number of times a user has listened to or streamed each song 
from a library of n songs over some period. These vectors are typically sparse, since 
each user has listened to only a very small fraction of the music library. Clustering 
the vectors reveals groups of users with similar musical taste. The group repre- 
sentatives have a nice interpretation: (z;); is the average number of times users in 
group 7 listened to song i. 

This interpretation allows us to create a set of recommendations for each user. 
We first identify which cluster j her music listening vector x; is in. Then we can 
suggest to her songs that she has not listened to, but others in her group (i.e., those 
with similar musical taste) have listened to most often. To recommend 5 songs to 
her, we find the indices } with (x;); = 0, with the 5 largest values of (z;):. 


Guessing missing entries. Suppose we have a collection of vectors, with some 
entries of some of the vectors missing or not given. (The symbol ‘?’ or ‘*’ is 
sometimes used to denote a missing entry in a vector.) For example, suppose 
the vectors collect attributes of a collection of people, such as age, sex, years of 
education, income, number of children, and so on. A vector containing the symbol 
‘?? in the age entry means that we do not know that particular person’s age. 
Guessing missing entries of vectors in a collection of vectors is sometimes called 
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imputing the missing entries. In our example, we might want to guess the age of 
the person whose age we do not know. 

We can use clustering, and the k-means algorithm in particular, to guess the 
missing entries. We first carry out k-means clustering on our data, using only those 
vectors that are complete, i.e., all of their entries are known. Now consider a vector 
x in our collection that is missing one or more entries. Since some of the entries of 
x are unknown, we cannot find the distances ||x — z,||, and therefore we cannot say 
which group representative is closest to x. Instead we will find the closest group 
representative to x using only the known entries in x, by finding j that minimizes 


do (ai — (2)s)?, 


iEK 


where Ķ is the set of indices for the known entries of the vector x. This gives us 
the closest representative to x, calculated using only its known entries. To guess 
the missing entries in x, we simply use the corresponding entries in zj, the nearest 
representative. 

Returning to our example, we would guess the age of the person with a missing 
age entry by finding the closest representative (ignoring age); then we use the age 
entry of the representative, which is simply the average age of all the people in that 
cluster. 


4.1 


Exercises 
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Minimizing mean square distance to a set of vectors. Let xz1,..., £z be a collection of 
n-vectors. In this exercise you will fill in the missing parts of the argument to show that 
the vector z which minimizes the sum-square distance to the vectors, 


J(z) = lla = 2l? +++ + llez = 2ll?, 


is the average or centroid of the vectors, = (1/L) (xı +--+- +z). (This result is used 
in one of the steps in the k-means algorithm. But here we have simplified the notation.) 


(a) Explain why, for any z, we have 


L L 
I(z) = X` lle- z- (2 =D)? = XC (lars — BI? — 2(2: — 7)” (2 — B) + Lz — 71°. 
i=l i=l 
(b) Explain why i (wi — 7)" (z — T) =0. Hint. Write the left-hand side as 
L T 
So@i-@)} (2-9), 
4=1 


and argue that the left-hand vector is 0. 


(c) Combine the results of (a) and (b) to get J(z) = aa e: — Z|? + Lllz — z|’. 
Explain why for any z 4 %, we have J(z) > J(%). This shows that the choice z = % 
minimizes J(z). 


4.2 k-means with nonnegative, proportions, or Boolean vectors. Suppose that the vectors 


£1,..., ZN are clustered using k-means, with group representatives z1,..., Zp. 


(a) Suppose the original vectors x; are nonnegative, i.e., their entries are nonnegative. 
Explain why the representatives z; are also nonnegative. 

(b) Suppose the original vectors x; represent proportions, i.e., their entries are nonneg- 
ative and sum to one. (This is the case when x; are word count histograms, for 
example.) Explain why the representatives zj also represent proportions, i.e., their 
entries are nonnegative and sum to one. 

(c) Suppose the original vectors x; are Boolean, t.e., their entries are either 0 or 1. Give 
an interpretation of (z;):, the ith entry of the j group representative. 


Hint. Each representative is the average of some of the original vectors. 


4.3 Linear separation in 2-way partitioning. Clustering a collection of vectors into k = 2 


groups is called 2-way partitioning, since we are partitioning the vectors into 2 groups, 
with index sets Gi and Gz. Suppose we run k-means, with k = 2, on the n-vectors 
%1,...,2N. Show that there is a nonzero vector w and a scalar v that satisfy 


wxi +v > 0 for i€ Gi, wea +v <0 for i € Go. 


In other words, the affine function f(x) = w7 a + v is greater than or equal to zero on 
the first group, and less than or equal to zero on the second group. This is called linear 
separation of the two groups (although affine separation would be more accurate). 


Hint. Consider the function ||x — 21||? — |x — z9|\?, where z1 and z2 are the group 
representatives. 
4.4 Pre-assigned vectors. Suppose that some of the vectors 71,..., £y are assigned to specific 


groups. For example, we might insist that x27 be assigned to group 5. Suggest a simple 
modification of the k-means algorithm that respects this requirement. Describe a practical 
example where this might arise, when each vector represents n features of a medical 
patient. 
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Chapter 5 


Linear independence 


In this chapter we explore the concept of linear independence, which will play an 
important role in the sequel. 


Linear dependence 


A collection or list of n-vectors a1, ...,ap (with k > 1) is called linearly dependent 
if 

Bia, +--+ + Bean =0 
holds for some (),..., 8p that are not all zero. In other words, we can form the 
zero vector as a linear combination of the vectors, with coefficients that are not all 
zero. Linear dependence of a list of vectors does not depend on the ordering of the 
vectors in the list. 

When a collection of vectors is linearly dependent, at least one of the vectors 
can be expressed as a linear combination of the other vectors: If 8; 4 0 in the 
equation above (and by definition, this must be true for at least one i), we can 
move the term p;a; to the other side of the equation and divide by 6; to get 


ai = (—ß1/Bijar +--+ + (—Bi-1/Bi)ai-1 + (—Bi41/Bi)aiza +: + (—Be/ Bian. 


The converse is also true: If any vector in a collection of vectors is a linear combi- 
nation of the other vectors, then the collection of vectors is linearly dependent. 

Following standard mathematical language usage, we will say “The vectors 
@1,.--,@, are linearly dependent” to mean “The list of vectors a,,..., ap is linearly 
dependent”. But it must be remembered that linear dependence is an attribute of 
a collection of vectors, and not individual vectors. 


Linearly independent vectors. A collection of n-vectors a1,...,a% (with k > 1) 
is called linearly independent if it is not linearly dependent, which means that 


Bia, +--+ + Prax = 0 (5.1) 
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only holds for 6; = --- = k = 0. In other words, the only linear combination of 
the vectors that equals the zero vector is the linear combination with all coefficients 
Zero. 

As with linear dependence, we will say “The vectors a1,...,a, are linearly 
independent” to mean “The list of vectors a1, ..., ap is linearly independent”. But, 
like linear dependence, linear independence is an attribute of a collection of vectors, 
and not individual vectors. 

It is generally not easy to determine by casual inspection whether or not a list 
of vectors is linearly dependent or linearly independent. But we will soon see an 
algorithm that does this. 


Examples. 


e A list consisting of a single vector is linearly dependent only if the vector is 
zero. It is linearly independent only if the vector is nonzero. 


e Any list of vectors containing the zero vector is linearly dependent. 


e A list of two vectors is linearly dependent if and only if one of the vectors 
is a multiple of the other one. More generally, a list of vectors is linearly 
dependent if any one of the vectors is a multiple of another one. 


e The vectors 


0.2 —0.1 0.0 
ay = —7.0 5 ag = 2.0 5 a3 = —1.0 
8.6 —1.0 2.2 


are linearly dependent, since a; + 2a2 — 3a3 = 0. We can express any of 
these vectors as a linear combination of the other two. For example, we have 
ag = (—1/2)aı + (3/2)a3. 


e The vectors 


1 0 =] 
ay = 0 5 ag = —1 5 a3 = 1 
0 1 1 


are linearly independent. To see this, suppose 31a, + B2a2 + 83a3 = 0. This 
means that 


bı — B3 = 9, —Bg + b3 = 0, Bo + B3 = 0. 


Adding the last two equations we find that 283 = —0, so 83 = 0. Using this, 
the first equation is then 3; = 0, and the second equation is 62 = 0. 


e The standard unit n-vectors €1,...,€n are linearly independent. To see this, 
suppose that (5.1) holds. We have 


pı 
0 = Brey +--+ + Bren = : , 
Bn 


so we conclude that 6; =---= Bn = 0. 
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Linear combinations of linearly independent vectors. Suppose a vector x is a 
linear combination of a1,..., ax, 


x = Pia, +--+ + Bea. 


When the vectors a1,...,@, are linearly independent, the coefficients that form x 
are unique: If we also have 


T = V101 +++ + Ykak, 


then 8; = yi for i =1,...,k. This tells us that, in principle at least, we can find 
the coefficients that form a vector x as a linear combination of linearly independent 
vectors. 

To see this, we subtract the two equations above to get 


0 = (b1 —Y1)a1 +--+ + (Be — Yk )ak- 


Since a,,...,@, are linearly independent, we conclude that 8; — y; are all zero. 

The converse is also true: If each linear combination of a list of vectors can 
only be expressed as a linear combination with one set of coefficients, then the 
list of vectors is linearly independent. This gives a nice interpretation of linear 
independence: A list of vectors is linearly independent if and only if for any linear 
combination of them, we can infer or deduce the associated coefficients. (We will 
see later how to do this.) 


Supersets and subsets. If a collection of vectors is linearly dependent, then any 
superset of it is linearly dependent. In other words: If we add vectors to a linearly 
dependent collection of vectors, the new collection is also linearly dependent. Any 
nonempty subset of a linearly independent collection of vectors is linearly inde- 
pendent. In other words: Removing vectors from a collection of vectors preserves 
linear independence. 


Basis 


Independence-dimension inequality. If the n-vectors a,,...,a, are linearly inde- 
pendent, then k < n. In words: 


A linearly independent collection of n-vectors can have at most n elements. 
Put another way: 
Any collection of n+1 or more n-vectors is linearly dependent. 


As avery simple example, we can conclude that any three 2-vectors must be linearly 
dependent. This is illustrated in figure 5.1. 

We will prove this fundamental fact below; but first, we describe the concept 
of basis, which relies on the independence-dimension inequality. 
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ay 


a2 Y202 


Figure 5.1 Left. Three 2-vectors. Right. The vector a3 is a linear combina- 
tion of a; and az, which shows that the vectors are linearly dependent. 


Basis. A collection of n linearly independent n-vectors (i.e., a collection of linearly 
independent vectors of the maximum possible size) is called a basis. If the n-vectors 
a1,--.,@n are a basis, then any n-vector b can be written as a linear combination 
of them. To see this, consider the collection of n + 1 n-vectors a1, ...,an,b. By the 
independence-dimension inequality, these vectors are linearly dependent, so there 
are 61,...,Bn+1, not all zero, that satisfy 


Byay +- + bnan + Bn41b = 0. 
If Bn+1 = 0, then we have 
Bia, +++: + Bran =0, 
which, since a,,...,@, are linearly independent, implies that 6; =--- = 6, = 0. 


But then all the 6; are zero, a contradiction. So we conclude that 8,4; 4 0. It 
follows that 


b= (—B1/Bn+41)a1 ets (—Bn/Bn+1)an, 
i.e., b is a linear combination of a1,..., an- 


Combining this result with the observation above that any linear combination 
of linearly independent vectors can be expressed in only one way, we conclude: 


Any n-vector b can be written in a unique way as a linear combination 
of a basis a1,..., an. 


Expansion in a basis. When we express an n-vector b as a linear combination of 
a basis a1, ...,đ@n, we refer to 


b = aiai ++: + Anan, 
as the expansion of b in the a1,...,an basis. The numbers aj,...,@,, are called 


the coefficients of the expansion of b in the basis a1,..., an. (We will see later how 
to find the coefficients in the expansion of a vector in a basis.) 


5.2 Basis 


93 


Examples. 


e The n standard unit n vectors €1,...,€n are a basis. Any n-vector b can be 
written as the linear combination 


b = biei +--+ + bn en. 


(This was already observed on page 17.) This expansion is unique, which 
means that there is no other linear combination of e1,..., €n that equals b. 


O 1.2 _ | —0.3 
“=g a 87 
are a basis. The vector b = (1, 1) can be expressed in only one way as a linear 
combination of them: 


e The vectors 


b = 0.6513 aı — 0.7280 a2. 


(The coefficients are given here to 4 significant digits. We will see later how 
these coefficients can be computed.) 


Cash flows and single period loans. As a practical example, we consider cash 
flows over n periods, with positive entries meaning income or cash in and negative 
entries meaning payments or cash out. We define the single-period loan cash flow 
vectors as 


l= i=1,...,n— 1 


? bi 


=(1 +r) |? 
On-i-1 


where r > 0 is the per-period interest rate. The cash flow l; represents a loan of $1 
in period 7, which is paid back in period i + 1 with interest r. (The subscripts on 
the zero vectors above give their dimensions.) Scaling l; changes the loan amount; 
scaling l; by a negative coefficient converts it into a loan to another entity (which 
is paid back in period i + 1 with interest). 

The vectors €1,11,...,l,—1 are a basis. (The first vector e; represents income of 
$1 in period 1.) To see this, we show that they are linearly independent. Suppose 
that 

Brey + Baly +--+ + Bnln-1 = 0. 


We can express this as 
Bi + b2 

b3 — (1 + r)b2 
=0. 
Bn ZE (1 T r)Bn-1 
The last entry is —(1 +71), = 0, which implies that 6, = 0 (since 1 +r > 0). 
Using 8n = 0, the second to last entry becomes —(1 + r)8n-1 = 0, so we conclude 
that 6,-, = 0. Continuing this way we find that Bn-2,..., 682 are all zero. The 
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first entry of the equation above, 3; + 62 = 0, then implies 6; = 0. We conclude 
that the vectors e1,l1,...,ln—1 are linearly independent, and therefore a basis. 

This means that any cash flow n-vector c can be expressed as a linear combi- 
nation of (i.e., replicated by) an initial payment and one period loans: 


c=aye, + Agl, afta Onla 
It is possible to work out what the coefficients are (see exercise 5.3). The most 
interesting one is the first coefficient 


Sage ie 
l+r (l4r)r-v? 


which is exactly the net present value (NPV) of the cash flow, with interest rate r. 
Thus we see that any cash flow can be replicated as an income in period 1 equal 
to its net present value, plus a linear combination of one-period loans at interest 
rate r. 


Proof of independence-dimension inequality. The proof is by induction on the 
dimension n. First consider a linearly independent collection a1,..., az of 1-vectors. 
We must have a; 4 0. This means that every element a; of the collection can be 
expressed as a multiple a; = (a;/a1)a; of the first element a1. This contradicts 
linear independence unless k = 1. 

Next suppose n > 2 and the independence-dimension inequality holds for di- 
mension n — 1. Let a,,...,a, be a linearly independent list of n-vectors. We need 
to show that k < n. We partition the vectors as 


a=|% |, t= 1.24;k, 


Qi 


where b; is an (n — 1)-vector and a; is a scalar. 


First suppose that ay = --- = ag, = 0. Then the vectors b),..., 5, are linearly 
independent: J`% ibi = 0 holds if and only if J$; Bia; = 0, which is only 
possible for 61 = --- = Bk = 0 because the vectors a; are linearly independent. 
The vectors b1,...,6% therefore form a linearly independent collection of (n — 1)- 


vectors. By the induction hypothesis we have k < n — 1, so certainly k < n. 
Next suppose that the scalars a; are not all zero. Assume a; # 0. We define a 
collection of k — 1 vectors c; of length n — 1 as follows: 


Q; 
c = bi — by, T= len ama Ci = bi41 — Hlb, i=j,... ace 
J 


j 
These k — 1 vectors are linearly independent: If Se Bici = 0 then 
j-l k 
b; bj bi | 
Salse J E Baf & | =o (5.2) 
i=1 i=j+1 
with 


E = ` 
y= 2S hat $ pma). 
J i=1 


i=j+1 
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Figure 5.2 Orthonormal vectors in a plane. 


Since the vectors a; = (bi, a;) are linearly independent, the equality (5.2) only holds 
when all the coefficients 8; and y are all zero. This in turns implies that the vectors 
C1,---,Ck—1 are linearly independent. By the induction hypothesis k — 1 < n — 1, 
so we have established that k < n. 


Orthonormal vectors 


A collection of vectors a1, ..., ap is orthogonal or mutually orthogonal if a; L a; for 
any i, j with i Æ j, i,j =1,...,k. A collection of vectors a1,...,a, is orthonormal 
if it is orthogonal and |la;|| = 1 for i = 1,...,&. (A vector of norm one is called 


normalized; dividing a vector by its norm is called normalizing it.) Thus, each 
vector in an orthonormal collection of vectors is normalized, and two different 
vectors from the collection are orthogonal. These two conditions can be combined 
into one statement about the inner products of pairs of vectors in the collection: 
Q,,...,@% is orthonormal means that 


1 i=j 

De J 

KAK { 0 iFj. 
Orthonormality, like linear dependence and independence, is an attribute of a col- 
lection of vectors, and not an attribute of vectors individually. By convention, 


though, we say “The vectors a,,...,@,% are orthonormal” to mean “The collection 
of vectors a1,...,@, is orthonormal”. 
Examples. The standard unit n-vectors €1,...,€n are orthonormal. As another 


example, the 3-vectors 


0 ’ Tz ’ = =1 ’ (5.3) 
—1 v2 0 v2 0 


are orthonormal. Figure 5.2 shows a set of two orthonormal 2-vectors. 


Linear independence of orthonormal vectors. Orthonormal vectors are linearly 
independent. To see this, suppose a1,...,a@p are orthonormal, and 


Bia, +-+: + Bka = 0. 
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Taking the inner product of this equality with a; yields 


0 = a; (Ba, +--+ Byax) 
= Bi(ajar) +--+ + rlata) 
Bi, 


since alaj = 0 for j 4 i and aTa; = 1. Thus, the only linear combination of 
a1,--.,@k that is zero is the one with all coefficients zero. 


Linear combinations of orthonormal vectors. Suppose a vector x is a linear com- 
bination of a,,...,@%, where a,,...,a,% are orthonormal, 


x = Bia, +---+ Brag. 


Taking the inner product of the left-hand and right-hand sides of this equation 
with a; yields 
a; x =a; (Biar +-+- + Bear) = Bi, 


using the same argument as above. So if a vector x is a linear combination of 
orthonormal vectors, we can easily find the coefficients of the linear combination 
by taking the inner products with the vectors. 
For any x that is a linear combination of orthonormal vectors a1,...,a@%, we 
have the identity 
z = (at xja, +-+: + (ak x)ag. (5.4) 


This identity gives us a simple way to check if an n-vector y is a linear combination 
of the orthonormal vectors a1,...,ap. If the identity (5.4) holds for y, i.e., 


y = (aT y)ay +--+ + (af y)ap, 


then (evidently) y is a linear combination of a1,...,a%; conversely, if y is a linear 
combination of a1,..., ag, the identity (5.4) holds for y. 


Orthonormal basis. If the n-vectors a1, ...,an are orthonormal, they are linearly 
independent, and therefore also a basis. In this case they are called an orthonormal 
basis. The three examples above (on page 95) are orthonormal bases. 


If a1,...,an is an orthonormal basis, then we have, for any n-vector x, the 
identity 

a = (ap x)a, +--+ + (al x)an. (5.5) 

To see this, we note that since a,,...,@, are a basis, x can be expressed as a 


linear combination of them; hence the identity (5.4) above holds. The equation 
above is sometimes called the orthonormal expansion formula; the right-hand side 
is called the expansion of x in the basis aj,...,@n. It shows that any n-vector can 
be expressed as a linear combination of the basis elements, with the coefficients 
given by taking the inner product of x with the elements of the basis. 

As an example, we express the 3-vector x = (1,2,3) as a linear combination of 
the orthonormal basis given in (5.3). The inner products of x with these vectors 
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are 


1 3 1 =1 
0 z = —3, — | 1 g= i 1 = —. 
= v2} 9 v2 V2) o v2 


It can be verified that the expansion of x in this basis is 


0 1 1 

3 1 —1 1 
=(-3 Oe ee (ee ee ee en -ıı 
BS) i ae V2\V21 o 


Gram-Schmidt algorithm 


In this section we describe an algorithm that can be used to determine if a list of 
n-vectors @1,...,@% is linearly independent. In later chapters we will see that it 
has many other uses as well. The algorithm is named after the mathematicians 
Jørgen Pedersen Gram and Erhard Schmidt, although it was already known before 
their work. 

If the vectors are linearly independent, the Gram-Schmidt algorithm produces 
an orthonormal collection of vectors qi,...,q, with the following properties: For 
each i = 1,...,k, a; is a linear combination of q,...,q;, and q; is a linear com- 
bination of a;,...,a;. If the vectors a1,...,aj—ı are linearly independent, but 
aj,...,a; are linearly dependent, the algorithm detects this and terminates. In 
other words, the Gram-Schmidt algorithm finds the first vector a; that is a linear 
combination of previous vectors a1,...,@j—1.- 


Algorithm 5.1 GRAM-SCHMIDT ALGORITHM 


given n-vectors a1,...,@k 
fori =l sesk 
1. Orthogonalization. q = ai — (Faia en (ql iai)qi-1 


2. Test for linear dependence. if qi = 0, quit. 
3. Normalization. qi = G/||G|l 


The orthogonalization step, with i = 1, reduces to q; = a1. If the algorithm does 
not quit (in step 2), i.e., @1,..-,@, are all nonzero, we can conclude that the original 
collection of vectors is linearly independent; if the algorithm does quit early, say, 
with g; = 0, we can conclude that the original collection of vectors is linearly 
dependent (and indeed, that a; is a linear combination of a1,...,aj—1). 

Figure 5.3 illustrates the Gram-Schmidt algorithm for two 2-vectors. The top 
row shows the original vectors; the middle and bottom rows show the first and 
second iterations of the loop in the Gram-Schmidt algorithm, with the left-hand 
side showing the orthogonalization step, and the right-hand side showing the nor- 
malization step. 
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a2 


qı 


qı qı 
a2 — (q7 a2)qı 


x q2 


Figure 5.3 Gram-Schmidt algorithm applied to two 2-vectors ai, a2. Top. 
The original vectors a; and ag. The gray circle shows the points with norm 
one. Middle left. The orthogonalization step in the first iteration yields 
qi = a1. Middle right. The normalization step in the first iteration scales 
qi to have norm one, which yields qı. Bottom left. The orthogonalization 
step in the second iteration subtracts a multiple of qi to yield the vector 
q2, which is orthogonal to qı. Bottom right. The normalization step in the 
second iteration scales gz to have norm one, which yields q2. 
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Analysis of Gram-Schmidt algorithm. Let us show that the following hold, for 
i= 1,...,k, assuming a),...,a, are linearly independent. 


1. qi 4 0, so the linear dependence test in step 2 is not satisfied, and we do not 
have a divide-by-zero error in step 3. 


2. q1,---,;q are orthonormal. 

3. a; is a linear combination of q1,..., qi. 

4. qi is a linear combination of aj,..., aj. 

We show this by induction. For i = 1, we have q; = a. Since aj,...,@, are 


linearly independent, we must have a, Æ 0, and therefore q, 4 0, so assertion 1 
holds. The single vector qı (considered as a list with one element) is evidently 
orthonormal, since ||q:|| = 1, so assertion 2 holds. We have a, = ||@i||q1, and 
qı = (1/||G1||)a1, so assertions 3 and 4 hold. 

Suppose our assertion holds for some i—1, with 7 < k; we will show it holds for 7. 
If qi = 0, then a; is a linear combination of q,...,qi—1 (from the first step in the 
algorithm); but each of these is (by the induction hypothesis) a linear combination 
of aj,...,a;-1, so it follows that a; is a linear combination of a1,...,a;-1, which 
contradicts our assumption that a1,...,ap are linearly independent. So assertion 1 
holds for i. 

Step 3 of the algorithm ensures that q1,...,q; are normalized; to show they are 
orthogonal we will show that q; L q; for j = 1,...,i—1. (Our induction hypothesis 
tells us that qy L qs for r,s < i.) For any j = 1,...,i— 1, we have (using step 1) 


Gh = Gai (auaa) (aaa) qi) 
= qai- qla; =0, 
using q7 Ik = 0 for j # k and q7 qj = 1. (This explains why step 1 is called the 
orthogonalization step: We subtract from a; a linear combination of q1,...,qi—1 
that ensures q; L qj for j < i.) Since q; = (1/||Gl|)Gi, we have g/g; = 0 for 
j=1,...,7—1. So assertion 2 holds for i. 
It is immediate that a; is a linear combination of q1, ..., qi: 
ai = Gt+(qrag +--+ + (qe14)qi-1 
= (Zaa ++ (GE 4i)gi-1 + lla. 


From step 1 of the algorithm, we see that q; is a linear combination of the vec- 


tors a1,q1,---,;@—-1. By the induction hypothesis, each of q,...,q;-1 is a linear 
combination of a1,...,@;-1, so G (and therefore also q;) is a linear combination of 
a,,...,a;. Thus assertions 3 and 4 hold. 


Gram-Schmidt completion implies linear independence. From the properties 
1—4 above, we can argue that the original collection of vectors a,,..., a, is linearly 
independent. To see this, suppose that 


Bia, +-+: + Bean =0 (5.6) 


100 


5 Linear independence 


holds for some 61,..., 8p. We will show that 6; =--- = Bk = 0. 

We first note that any linear combination of q1,...,q,—1 is orthogonal to any 
multiple of qk, since qq, = +- = G_i% = 0 (by definition). But each of 
s ,@kķ—1 is a linear combination of q1,...,qķk-1, so we have ae a=: 


qF aķ—-ı = 0. Taking the inner product of q, with the left- and right-hand sides 
of (5.6) we obtain 


0 = gg (Bia +--+ + Bear) 
= Biqpart-+++Br—-19¢ ak—1 + Brae ak 
= Bxlldell, 


where we use gf ax = ||q|| in the last line. We conclude that 6; = 0. 
From (5.6) and 8p = 0 we have 


Bai +++: + Bk-10k-1 = 0. 


We now repeat the argument above to conclude that 6,_,; = 0. Repeating it k 
times we conclude that all 8; are zero. 


Early termination. Suppose that the Gram-Schmidt algorithm terminates pre- 
maturely, in iteration j, because g; = 0. The conclusions 1-4 above hold for 
i = 1,...,j— 1, since in those steps q; is nonzero. Since ĝj = 0, we have 

aj = (qï 9) +++ + (Gj_-145)qj-1; 


which shows that aj is a linear combination of q,...,qj;-1. But each of these 
vectors is in turn a linear combination of a1, ...,aj—1, by conclusion 3 above. Then 
a; is a linear combination of a1,...,a;~1, since it is a linear combination of linear 
combinations of them (see exercise 1.18). This means that a,...,a; are linearly 
dependent, which implies that the larger set a1,...,ap is linearly dependent. 

In summary, the Gram-Schmidt algorithm gives us an explicit method for de- 
termining if a list of vectors is linearly dependent or independent. 


Example. We define three vectors 
& = (Liat), ag = (—1, 3, 1,3), a3 = (1,3, 5, 7). 
Applying the Gram-Schmidt algorithm gives the following results. 


e i= 1. We have ||q;|| = 2, so 


T 
qı = jal” = (—1/2, 1/2, —1/2, 1/2), 


which is simply a; normalized. 


e i =2. We have qj az = 4, so 


-1 -1/2 1 

~ 3 1/2 1 
q2 = a2 — (qZ a2)q1 = —1 = 4 a = 1 , 

3 1/2 1 
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which is indeed orthogonal to qı (and a1). It has norm ||ĝ2|| = 2; normalizing 
it gives 


a= aa” = (1/2, 1/2, 1/2, 1/2). 
e i = 3. We have ql a3 = 2 and qZ a3 = 8, so 
az)qı — (93 43) 42 
-1/2 1/2 
ae pie 
=1/2 1/2 
1/2 | | 1/2 | 


Q 
wo 
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wo 
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which is orthogonal to qı and q2 (and a; and az). We have ||ĝ3|| = 4, so the 
normalized vector is 


EEEE E E A T 
laa 


Completion of the Gram-Schmidt algorithm without early termination tells us that 
the vectors a1, a2, a3 are linearly independent. 


Determining if a vector is a linear combination of linearly independent vectors. 
Suppose the vectors a),...,@, are linearly independent, and we wish to determine 
if another vector b is a linear combination of them. (We have already noted on 
page 91 that if it is a linear combination of them, the coefficients are unique.) 
The Gram-Schmidt algorithm provides an explicit way to do this. We apply the 
Gram-Schmidt algorithm to the list of k + 1 vectors 


Q1,..-, đak, b. 


These vectors are linearly dependent if b is a linear combination of aj,..., ax; 
they are linearly independent if b is not a linear combination of a,,...,a,. The 
Gram-Schmidt algorithm will determine which of these two cases holds. It cannot 
terminate in the first k steps, since we assume that a1,...,ap are linearly indepen- 
dent. It will terminate in the (k+1)st step with ĝk+1 = 0 if b is a linear combination 
of aj,...,@x. It will not terminate in the (k + 1)st step (i.e., qk+1 4 0), otherwise. 


Checking if a collection of vectors is a basis. To check if the n-vectors a1,..., an 
are a basis, we run the Gram-Schmidt algorithm on them. If Gram-Schmidt 
terminates early, they are not a basis; if it runs to completion, we know they are a 
basis. 


102 


5 Linear independence 


Complexity of the Gram-Schmidt algorithm. We now derive an operation count 
for the Gram-Schmidt algorithm. In the first step of iteration i of the algorithm, 
i — 1 inner products 
Gai, ..., ql ai 

between vectors of length n are computed. This takes (i — 1)(2n — 1) flops. We 
then use these inner products as the coefficients in į — 1 scalar multiplications with 
the vectors q1,...,@j-1- This requires n(i — 1) flops. We then subtract the i — 1 
resulting vectors from a;, which requires another n(i— 1) flops. The total flop count 
for step 1 is 


(i —1)(2n — 1) + n(i—1) + n(é-1) = (4n — 1)(i— 1) 


flops. In step 3 we compute the norm of ĝi, which takes approximately 2n flops. 
We then divide q; by its norm, which requires n scalar divisions. So the total flop 
count for the ith iteration is (4n — 1) (i — 1) + 3n flops. 

The total flop count for all k iterations of the algorithm is obtained by summing 
our counts fori =1,...,k: 


k 
So ((4n — 1)(i — 1) + 3n) = (4n — yo + 3nk & 2nk?, 
i=l 
where we use the fact that 
k 
k(k-—1 
D E 424-4 2) 4 e-1) = EY. (5.7) 
i=l 


which we justify below. The complexity of the Gram-Schmidt algorithm is 2nk?; 
its order is nk?. We can guess that its running time grows linearly with the lengths 
of the vectors n, and quadratically with the number of vectors k. 

In the special case of k = n, the complexity of the Gram-Schmidt method is 
2n. For example, if the Gram-Schmidt algorithm is used to determine whether a 
collection of 1000 1000-vectors is linearly independent (and therefore a basis), the 
computational cost is around 2 x 10° flops. On a modern computer, can we can 
expect this to take on the order of one second. 

A famous anecdote alleges that the formula (5.7) was discovered by the mathe- 
matician Carl Friedrich Gauss when he was a child, although it was known before 
that time. Here is his argument, for the case when k is odd. Lump the first entry 
in the sum together with the last entry, the second entry together with the second- 
to-last entry, and so on. Each of these pairs of numbers adds up to k; since there 
are (k — 1)/2 such pairs, the total is k(k — 1)/2. A similar argument works when 
k is even. 


Modified Gram-Schmidt algorithm. When the Gram-Schmidt algorithm is im- 
plemented, a variation on it called the modified Gram-Schmidt algorithm is typ- 
ically used. This algorithm produces the same results as the Gram-Schmidt al- 
gorithm (5.1), but is less sensitive to the small round-off errors that occur when 
arithmetic calculations are done using floating point numbers. (We do not consider 
round-off error in floating-point computations in this book.) 
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Linear independence of stacked vectors. Consider the stacked vectors 


— a1 = Qk 
cq = bi 3 -3 CE = by ; 


where ai,...,@% are n-vectors and b1,...,bk are m-vectors. 

(a) Suppose a1,...,ap are linearly independent. (We make no assumptions about the 
vectors b1,..., bx.) Can we conclude that the stacked vectors ci,...,c are linearly 
independent? 

(b) Now suppose that ai,...,a% are linearly dependent. (Again, with no assumptions 
about bi,...,6%.) Can we conclude that the stacked vectors ci,...,c, are linearly 
dependent? 

A surprising discovery. An intern at a quantitative hedge fund examines the daily returns 


of around 400 stocks over one year (which has 250 trading days). She tells her supervisor 
that she has discovered that the returns of one of the stocks, Google (GOOG), can be 
expressed as a linear combination of the others, which include many stocks that are 
unrelated to Google (say, in a different type of business or sector). 

Her supervisor then says: “It is overwhelmingly unlikely that a linear combination of the 
returns of unrelated companies can reproduce the daily return of GOOG. So you’ve made 
a mistake in your calculations.” 


Is the supervisor right? Did the intern make a mistake? Give a very brief explanation. 


Replicating a cash flow with single-period loans. We continue the example described 
on page 93. Let c be any n-vector representing a cash flow over n periods. Find the 
coefficients a1,...,Q@n of c in its expansion in the basis e1,11,...,In-1, i€., 


c = œe + ely +++: + Anln-1. 


Verify that a1 is the net present value (NPV) of the cash flow c, defined on page 22, 
with interest rate r. Hint. Use the same type of argument that was used to show that 
e1,1,...,dn—1 are linearly independent. Your method will find a, first, then an-ı, and 
so on. 


Norm of linear combination of orthonormal vectors. Suppose ai,...,a,% are orthonormal 
n-vectors, and x = ĝ1aı +---+ Graz, where (1,..., 8% are scalars. Express ||z|| in terms 


of 8 = (b1, brig Br): 


Orthogonalizing vectors. Suppose that a and b are any n-vectors. Show that we can always 
find a scalar y so that (a — yb) L b, and that y is unique if b 4 0. (Give a formula for 
the scalar y.) In other words, we can always subtract a multiple of a vector from another 
one, so that the result is orthogonal to the original vector. The orthogonalization step in 
the Gram-Schmidt algorithm is an application of this. 


Gram-Schmidt algorithm. Consider the list of n n-vectors 
1 1 1 
0 1 1 
w= 0 , a2 = 0 , , an = 1 
0 0 1 


(The vector a; has its first i entries equal to one, and the remaining entries zero.) Describe 
what happens when you run the Gram-Schmidt algorithm on this list of vectors, i.e., say 
what q1,...,qn are. Is ai,...,@n a basis? 
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5.7 


5.8 


5.9 


Running Gram-Schmidt algorithm twice. We run the Gram-Schmidt algorithm once on 
a given set of vectors ai,...,@% (we assume this is successful), which gives the vectors 
qi;---,Qk- Then we run the Gram-Schmidt algorithm on the vectors q1,...,qk, which 
produces the vectors z1,..., 2x. What can you say about 21,..., Zk? 


Early termination of Gram-Schmidt algorithm. When the Gram-Schmidt algorithm is 
run on a particular list of 10 15-vectors, it terminates in iteration 5 (since gs = 0). Which 
of the following must be true? 


(a) a2,a3,a4 are linearly independent. 

(b) a1, a@2,a5 are linearly dependent. 

(c) a1, @2,@3,@4,a5 are linearly dependent. 
(d) a4 is nonzero. 


A particular computer can carry out the Gram-Schmidt algorithm on a list of k = 1000 
n-vectors, with n = 10000, in around 2 seconds. About how long would you expect it to 
take to carry out the Gram-Schmidt algorithm with k = 500 n-vectors, with ñ = 1000? 


Part II 


Matrices 


6.1 


Chapter 6 


Matrices 


In this chapter we introduce matrices and some basic operations on them. We give 
some applications in which they arise. 


Matrices 


A matrix is a rectangular array of numbers written between rectangular brackets, 
as in 

0 1 -2.3 0.1 

1.3 4 —01 0 

4.1 —1 0 1.7 


It is also common to use large parentheses instead of rectangular brackets, as in 


0 1 —2.3 0.1 
1.3 4 —01 0 
4.1 —1 0 1.7 


An important attribute of a matrix is its size or dimensions, i.e., the numbers of 
rows and columns. The matrix above has 3 rows and 4 columns, so its size is 3 x 4. 
A matrix of size m x n is called an m x n matrix. 

The elements (or entries or coefficients) of a matrix are the values in the array. 
The 7,7 element is the value in the ith row and jth column, denoted by double 
subscripts: the i,j element of a matrix A is denoted A;,; (or A; j, when i or j is 
more than one digit or character). The positive integers i and j are called the (row 
and column) indices. If A is an m x n matrix, then the row index i runs from 1 to 
m and the column index j runs from 1 to n. Row indices go from top to bottom, 
so row 1 is the top row and row m is the bottom row. Column indices go from left 
to right, so column 1 is the left column and column n is the right column. 

If the matrix above is B, then we have By3 = —2.3, B32 = —1. The row index 
of the bottom left element (which has value 4.1) is 3; its column index is 1. 

Two matrices are equal if they have the same size, and the corresponding entries 
are all equal. As with vectors, we normally deal with matrices with entries that 
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are real numbers, which will be our assumption unless we state otherwise. The set 
of real m x n matrices is denoted R”*”. But matrices with complex entries, for 
example, do arise in some applications. 


Matrix indexing. As with vectors, standard mathematical notation indexes the 
rows and columns of a matrix starting from 1. In computer languages, matrices 
are often (but not always) stored as 2-dimensional arrays, which can be indexed in 
a variety of ways, depending on the language. Lower level languages typically use 
indices starting from 0; higher level languages and packages that support matrix 
operations usually use standard mathematical indexing, starting from 1. 


Square, tall, and wide matrices. A square matrix has an equal number of rows 
and columns. A square matrix of size n x n is said to be of order n. A tall matrix 
has more rows than columns (size m x n with m > n). A wide matrix has more 
columns than rows (size m x n with n > m). 


Column and row vectors. An n-vector can be interpreted as an n x 1 matrix; we 
do not distinguish between vectors and matrices with one column. A matrix with 
only one row, i.e., with size 1 x n, is called a row vector; to give its size, we can 
refer to it as an n-row-vector. As an example, 


| -2.1 -3 0] 


is a 3-row-vector (or 1 x 3 matrix). To distinguish them from row vectors, vectors 
are sometimes called column vectors. A 1 x 1 matrix is considered to be the same 
as a scalar. 


Notational conventions. Many authors (including us) tend to use capital letters 
to denote matrices, and lower case letters for (column or row) vectors. But this 
convention is not standardized, so you should be prepared to figure out whether 
a symbol represents a matrix, column vector, row vector, or a scalar, from con- 
text. (The more considerate authors will tell you what the symbols represent, for 
example, by referring to ‘the matrix A’ when introducing it.) 


Columns and rows of a matrix. An m xn matrix A has n columns, given by (the 


m-vectors) 
Ajj 
aj = : 
Amj 
for j =1,...,n. The same matrix has m rows, given by the (n-row-vectors) 
b= | Aa © Am], 
for i = 1,..., m. 


As a specific example, the 2 x 3 matrix 


1 2 3 
4 5 6 
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has first row 
[1 2 3] 


(which is a 3-row-vector or a 1 x 3 matrix), and second column 


2 
5 
(which is a 2-vector or 2 x 1 matrix), also written compactly as (2,5). 


Block matrices and submatrices. It is useful to consider matrices whose entries 
are themselves matrices, as in 


BC 
isok]! 


where B, C, D, and E are matrices. Such matrices are called block matrices; the 
elements B, C, D, and E are called blocks or submatrices of A. The submatrices 
can be referred to by their block row and column indices; for example, C is the 1,2 
block of A. 

Block matrices must have the right dimensions to fit together. Matrices in the 
same (block) row must have the same number of rows (i.e., the same ‘height’); 
matrices in the same (block) column must have the same number of columns (i.e., 
the same ‘width’). In the example above, B and C must have the same number of 
rows, and C and E must have the same number of columns. Matrix blocks placed 
next to each other in the same row are said to be concatenated; matrix blocks 
placed above each other are called stacked. 

As an example, consider 


2 2 1 4 
B=[o2s], om[a], v=[? 23], z-i] 
Then the block matrix A above is given by 


OU? et 
A=|2 21 4 |. (6.1) 
13 5 4 


(Note that we have dropped the left and right brackets that delimit the blocks. 
This is similar to the way we drop the brackets in a 1 x 1 matrix to get a scalar.) 

We can also divide a larger matrix (or vector) into ‘blocks’. In this context the 
blocks are called submatrices of the big matrix. As with vectors, we can use colon 
notation to denote submatrices. If A is an m x n matrix, and p, q, r, s are integers 
with 1<p<q<mandl<r<s<n, then Ap:q,r:s denotes the submatrix 


Apr Ap,r+1 uv Aps 
Ap+1,r Ap+1,r+1 Fe Ap+1,s 


Ap:q,r:s << 


Ag, Agr+1 Ai Ags 
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This submatrix has size (q — p+ 1) x (s—r + 1) and is obtained by extracting from 
A the elements in rows p through q and columns r through s. 
For the specific matrix A in (6.1), we have 


1 4 
Ao:3,3:4 = | 5 A | : 


Column and row representation of a matrix. Using block matrix notation we 
can write an m x n matrix A as a block matrix with one block row and n block 
columns, 

A=|a a2 > An |; 


where aj, which is an m-vector, is the jth column of A. Thus, an m x n matrix 
can be viewed as its n columns, concatenated. 

Similarly, an m x n matrix A can be written as a block matrix with one block 
column and m block rows: 


bm 


where b;, which is a row n-vector, is the ith row of A. In this notation, the matrix 
A is interpreted as its m rows, stacked. 


Examples 


Table interpretation. The most direct interpretation of a matrix is as a table of 
numbers that depend on two indices, i and j. (A vector is a list of numbers that 
depend on only one index.) In this case the rows and columns of the matrix usually 
have some simple interpretation. Some examples are given below. 


e Images. A black and white image with M x N pixels is naturally represented 
as an M x N matrix. The row index 7 gives the vertical position of the pixel, 
the column index j gives the horizontal position of the pixel, and the 7,7 
entry gives the pixel value. 


e Rainfall data. An m x n matrix A gives the rainfall at m different locations 
on n consecutive days, so A42 (which is a number) is the rainfall at location 
4 on day 2. The jth column of A, which is an m-vector, gives the rainfall at 
the m locations on day j. The ith row of A, which is an n-row-vector, is the 
time series of rainfall at location i. 


e Asset returns. A T x n matrix R gives the returns of a collection of n assets 
(called the universe of assets) over T periods, with R;; giving the return of 
asset j in period i. So Rı2,7 = —0.03 means that asset 7 had a 3% loss in 
period 12. The 4th column of R is a T-vector that is the return time series 
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Date AAPL GOOG MMM AMZN 


March 1, 2016 0.00219 0.00006 —0.00113 0.00202 
March 2, 2016 0.00744 —0.00894 —0.00019 —0.00468 
March 3, 2016 0.01488 —0.00215 0.00433 —0.00407 


Table 6.1 Daily returns of Apple (AAPL), Google (GOOG), 3M (MMM), 
and Amazon (AMZN), on March 1, 2, and 3, 2016 (based on closing prices). 


for asset 4. The 3rd row of R is an n-row-vector that gives the returns of all 
assets in the universe in period 3. 


An example of an asset return matrix, with a universe of n = 4 assets over 
T = 3 periods, is shown in table 6.1. 


e Prices from multiple suppliers. An m x n matrix P gives the prices of n 
different goods from m different suppliers (or locations): P;j is the price that 
supplier 7 charges for good j. The jth column of P is the m-vector of supplier 
prices for good j; the ith row gives the prices for all goods from supplier i. 


e Contingency table. Suppose we have a collection of objects with two at- 
tributes, the first attribute with m possible values and the second with n 
possible values. An m x n matrix A can be used to hold the counts of the 
numbers of objects with the different pairs of attributes: A;; is the number 
of objects with first attribute i and second attribute j. (This is the analog 
of a count n-vector, that records the counts of one attribute in a collection.) 
For example, a population of college students can be described by a 4 x 50 
matrix, with the 7,7 entry the number of students in year 7 of their studies, 
from state j (with the states ordered in, say, alphabetical order). The ith row 
of A gives the geographic distribution of students in year į of their studies; 
the jth column of A is a 4-vector giving the numbers of student from state j 
in their first through fourth years of study. 


e Customer purchase history. Ann x N matrix P can be used to store a set of 
N customers’ purchase histories of n products, items, or services, over some 
period. The entry P;; represents the dollar value of product 2 that customer j 
purchased over the period (or as an alternative, the number or quantity of the 
product). The jth column of P is the purchase history vector for customer 7; 
the ith row gives the sales report for product i across the N customers. 


Matrix representation of a collection of vectors. Matrices are very often used 
as a compact way to give a set of indexed vectors of the same size. For example, 
if x1,..., £y are n-vectors that give the n feature values for each of N objects, we 
can collect them all into one n x N matrix 


X=| my GQ 0c tien 
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Figure 6.1 The relation (6.2) as a directed graph. 


often called a data matrix or feature matrix. Its jth column is the feature n-vector 
for the jth object (in this context sometimes called the jth example). The ith row 
of the data matrix X is an N-row-vector whose entries are the values of the ith 
feature across the examples. We can also directly interpret the entries of the data 
matrix: X;; (which is a number) is the value of the ith feature for the jth example. 

As another example, a 3 x M matrix can be used to represent a collection of 
M locations or positions in 3-D space, with its jth column giving the jth position. 


Matrix representation of a relation or graph. Suppose we have n objects labeled 
1,...,n. A relation R on the set of objects {1,...,n} is a subset of ordered pairs 
of objects. As an example, can represent a preference relation among n possible 
products or choices, with (i, j) E€ R meaning that choice i is preferred to choice j. 

A relation can also be viewed as a directed graph, with nodes (or vertices) labeled 
1,...,n, and a directed edge from j to i for each (i, j) E€ R. This is typically drawn 
as a graph, with arrows indicating the direction of the edge, as shown in figure 6.1, 
for the relation on 4 objects 


R = {(1,2), (1,3), (2,1), (2,4), (3,4), (4, 1)}. (6.2) 
A relation R on {1,...,n} is represented by the n x n matrix A with 
Await (I)ER 
7 (0 EJER. 


This matrix is called the adjacency matrix associated with the graph. (Some au- 
thors define the adjacency matrix in the reverse sense, with A;; = 1 meaning there 
is an edge from i to j.) The relation (6.2), for example, is represented by the matrix 


This is the adjacency matrix of the associated graph, shown in figure 6.1. (We will 
encounter another matrix associated with a directed graph in §7.3.) 


6.2 


6.2 Zero and identity matrices 
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Zero and identity matrices 


Zero matrix. A zero matrix is a matrix with all elements equal to zero. The zero 
matrix of size m x n is sometimes written as 0m xn, but usually a zero matrix is 
denoted just 0, the same symbol used to denote the number 0 or zero vectors. In 
this case the size of the zero matrix must be determined from the context. 


Identity matrix. An identity matrix is another common matrix. It is always 
square. Its diagonal elements, i.e., those with equal row and column indices, are 
all equal to one, and its off-diagonal elements (those with unequal row and column 
indices) are zero. Identity matrices are denoted by the letter J. Formally, the 
identity matrix of size n is defined by 


_f 1 isj 
={0 izi 


for i, j =1,...,n. For example, 
1 0 0 0 
1 0 0 1 0 0 
0 1 |’ 0 0 1 0 
000 1 


are the 2 x 2 and 4 x 4 identity matrices. 
The column vectors of the n x n identity matrix are the unit vectors of size n. 
Using block matrix notation, we can write 


I=| e ez = en |, 


where e; is the kth unit vector of size n. 

Sometimes a subscript is used to denote the size of an identity matrix, as in 
I4 or Igy. But more often the size is omitted and follows from the context. For 
example, if 


then 


~n 

= | 

lI 
oooor 
oooro 
CoOOrF Fr 
Coro op 
rFPOOoOOw 


The dimensions of the two identity matrices follow from the size of A. The identity 

matrix in the 1,1 position must be 2 x 2, and the identity matrix in the 2,2 position 

must be 3 x 3. This also determines the size of the zero matrix in the 2,1 position. 
The importance of the identity matrix will become clear later, in §10.1. 


114 


6 Matrices 


Sparse matrices. A matrix A is said to be sparse if many of its entries are zero, 
or (put another way) just a few of its entries are nonzero. Its sparsity pattern is 
the set of indices (i,j) for which A;; # 0. The number of nonzeros of a sparse 
matrix A is the number of entries in its sparsity pattern, and denoted nnz(A). If 
Ais m x n we have nnz(A) < mn. Its density is nnz(A)/(mn), which is no more 
than one. Densities of sparse matrices that arise in applications are typically small 
or very small, as in 107? or 1074. There is no precise definition of how small the 
density must be for a matrix to qualify as sparse. A famous definition of sparse 
matrix due to the mathematician James H. Wilkinson is: A matrix is sparse if it 
has enough zero entries that it pays to take advantage of them. Sparse matrices 
can be stored and manipulated efficiently on a computer. 

Many common matrices are sparse. An n x n identity matrix is sparse, since 
it has only n nonzeros, so its density is 1/n. The zero matrix is the sparsest 
possible matrix, since it has no nonzero entries. Several special sparsity patterns 
have names; we describe some important ones below. 

Like sparse vectors, sparse matrices arise in many applications. A typical cus- 
tomer purchase history matrix (see page 111) is sparse, since each customer has 
likely only purchased a small fraction of all the products available. 


Diagonal matrices. A square n x n matrix A is diagonal if Aj; = 0 for i F j. 
(The entries of a matrix with i = j are called the diagonal entries; those with i Æ j 
are its off-diagonal entries.) A diagonal matrix is one for which all off-diagonal 
entries are zero. Examples of diagonal matrices we have already seen are square 
zero matrices and identity matrices. Other examples are 


3 0 02 0 0 
| 0 0 | ‘ 0 -3 0 
0 0 12 
(Note that in the first example, one of the diagonal elements is also zero.) 
The notation diag(a1,...,@n) is used to compactly describe the n x n diagonal 
matrix A with diagonal entries A11 = a1, -.., Ann = an. This notation is not yet 


standard, but is coming into more prevalent use. As examples, the matrices above 
would be expressed as 


diag(—3, 0), diag (0.2, —3, 1.2), 
respectively. We also allow diag to take one n-vector argument, as in J = diag(1). 
Triangular matrices. A square n x n matrix A is upper triangular if Aij = 0 for 
i > j, and it is lower triangular if Aij = 0 for i < j. (So a diagonal matrix is 


one that is both lower and upper triangular.) If a matrix is either lower or upper 
triangular, it is called triangular. For example, the matrices 


1 -1 07 oe 8 
0 12 -11 l, Er 
0 0 32 = 


are upper and lower triangular, respectively. 
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A triangular n x n matrix A has up to n(n + 1)/2 nonzero entries, i.e., around 
half its entries are zero. Triangular matrices are generally not considered sparse 
matrices, since their density is around 50%, but their special sparsity pattern will 
be important in the sequel. 


6.3 Transpose, addition, and norm 


6.3.1 Matrix transpose 


If A is an m x n matrix, its transpose, denoted A’ (or sometimes A’ or A*), is the 
n x m matrix given by (A‘);; = Aji. In words, the rows and columns of A are 
transposed in A’. For example, 


T 
ro) 3) 07 9) 
3 1 4 0 1 


If we transpose a matrix twice, we get back the original matrix: (A7)? = A. (The 
superscript T in the transpose is the same one used to denote the inner product of 
two n-vectors; we will soon see how they are related.) 


Row and column vectors. ‘Transposition converts row vectors into column vectors 
and vice versa. It is sometimes convenient to express a row vector as a”, where 
a is a column vector. For example, we might refer to the m rows of an m x n 
matrix A as a? az, where @,...,@m are (column) n-vectors. As an example, 


ana wea 
the second row of the matrix 
0 7 8 
| 4 0 1 
can be written as (the row vector) (4,0, 1)”. 

It is common to extend concepts from (column) vectors to row vectors, by 
applying the concept to the transposed row vectors. We say that a collection of 
row vectors is linearly dependent (or independent) if their transposes (which are 
column vectors) are linearly dependent (or independent). For example, ‘the rows 
of a matrix A are linearly independent’ means that the columns of A” are linearly 
independent. As another example, ‘the rows of a matrix A are orthonormal’ means 
that their transposes, the columns of A’, are orthonormal. ‘Clustering the rows of 
a matrix X’ means clustering the columns of XT. 


Transpose of block matrix. The transpose of a block matrix has the simple form 
(shown here for a 2 x 2 block matrix) 


A B] [AT ct? 
C D| Be Dee 

where A, B, C, and D are matrices with compatible sizes. The transpose of a block 

matrix is the transposed block matrix, with each element transposed. 
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Document-term matrix. Consider a corpus (collection) of N documents, with 
word count vectors for a dictionary with n words. The document-term matrix 
associated with the corpus is the N xn matrix A, with A;; the number of times word 
j appears in document i. The rows of the document-term matrix are aj ,..., ay, 
where the n-vectors a,,...,@y are the word count vectors for documents 1,..., N, 
respectively. The columns of the document-term matrix are also interesting. The 
jth column of A, which is an N-vector, gives the number of times word 7 appears 
in the corpus of N documents. 


Data matrix. A collection of N n-vectors, for example feature n-vectors associated 
with N objects, can be given as an n x N matrix whose N columns are the vectors, 
as described on page 111. It is also common to describe this collection of vectors 
using the transpose of that matrix. In this case, we give the vectors as an N x n 
matrix X. Its ith row is x7, the transpose of the ith vector. Its jth column gives 
the value of the jth entry (or feature) across the collection of N vectors. When 
an author refers to a data matrix or feature matrix, it can usually be determined 
from context (for example, its dimensions) whether they mean the data matrix 
organized by rows or columns. 


Symmetric matrix. A square matrix A is symmetric if A = A’, i.e., Aij = Aji 
for all i, j. Symmetric matrices arise in several applications. For example, suppose 
that A is the adjacency matrix of a graph or relation (see page 112). The matrix A 
is symmetric when the relation is symmetric, i.e., whenever (i, j) € R, we also have 
(j,i) ER. An example is the friend relation on a set of n people, where (i, j) ER 
means that person ¿į and person j are friends. (In this case the associated graph is 
called the ‘social network graph’.) 


Matrix addition 


Two matrices of the same size can be added together. The result is another matrix 
of the same size, obtained by adding the corresponding elements of the two matrices. 
For example, 


0 4 1 2 1 6 
7 O;+]2 3]/=] 9 3 
3 1 3 5 


Matrix subtraction is similar. As an example, 


(This gives another example where we have to figure out the size of the identity 
matrix. Since we can only add or subtract matrices of the same size, I refers to a 
2 x 2 identity matrix.) 


Properties of matrix addition. The following important properties of matrix ad- 
dition can be verified directly from the definition. We assume here that A, B, and 
C are matrices of the same size. 


6.3.3 


6.3.4 


6.3 Transpose, addition, and norm 
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e Commutativity. A+ B = B +A. 


Associativity. (A+ B)+C = A+ (B+C). We therefore write both as 
A+B+C. 


e Addition with zero matrix. A +0 = 0 + A = A. Adding the zero matrix to a 
matrix has no effect. 


e Transpose of sum. (A+B)? = AT + BT. The transpose of a sum of two 
matrices is the sum of their transposes. 
Scalar-matrix multiplication 


Scalar multiplication of matrices is defined in a similar way as for vectors, and is 
done by multiplying every element of the matrix by the scalar. For example 


1 6 -2 -12 
(-2)| 9 3 |=] -18 -6 
6 0 -12 0 


As with scalar-vector multiplication, the scalar can also appear on the right. Note 
that 0 A = 0 (where the left-hand zero is the scalar zero, and the right-hand 0 is 
the zero matrix). 

Several useful properties of scalar multiplication follow directly from the defini- 
tion. For example, (8A)? = 8(A”) for a scalar 6 and a matrix A. If A is a matrix 
and 8, y are scalars, then 


(86 +7)A=bA+7A, (By)A = B(yA). 


It is useful to identify the symbols appearing in these two equations. The + symbol 
on the left of the left-hand equation is addition of scalars, while the + symbol on 
the right of the left-hand equation denotes matrix addition. On the left side of 
the right-hand equation we see scalar-scalar multiplication (aĝ) and scalar-matrix 
multiplication; on the right we see two cases of scalar-matrix multiplication. 

Finally, we mention that scalar-matrix multiplication has higher precedence 
than matrix addition, which means that we should carry out multiplication before 
addition (when there are no parentheses to fix the order). So the right-hand side 
of the left equation above is to be interpreted as (8A) + (yA). 


Matrix norm 


The norm of an m x n matrix A, denoted ||A]|, is the squareroot of the sum of the 
squares of its entries, 


|| All = (6.3) 
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This agrees with our definition for vectors when A is a vector, i.e., n = 1. The 
norm of an m x n matrix is the norm of an mn-vector formed from the entries of 
the matrix (in any order). Like the vector norm, the matrix norm is a quantitative 
measure of the magnitude of a matrix. In some applications it is more natural 
to use the RMS values of the matrix entries, ||A||/\/mn, as a measure of matrix 
size. The RMS value of the matrix entries tells us the typical size of the entries, 
independent of the matrix dimensions. 

The matrix norm (6.3) satisfies the properties of any norm, given on page 46. 
For any mxn matrix A, we have || A|| > 0 (i.e., the norm is nonnegative), and || A|| = 
0 only if A = 0 (definiteness). The matrix norm is nonnegative homogeneous: For 
any scalar y and m x n matrix A, we have ||yA|| = |y||| A||. Finally, for any two 
m x n matrices A and B, we have the triangle inequality, 


[A+ Bll < |All + IBI. 


(The plus symbol on the left-hand side is matrix addition, and the plus symbol on 
the right-hand side is addition of numbers.) 

The matrix norm allows us to define the distance between two matrices as 
||A — B||. As with vectors, we can say that one matrix is close to, or near, another 
one if their distance is small. (What qualifies as small depends on the application.) 

In this book we will only use the matrix norm (6.3). Several other norms of 
a matrix are commonly used, but are beyond the scope of this book. In contexts 
where other norms of a matrix are used, the norm (6.3) is called the Frobenius 
norm, after the mathematician Ferdinand Georg Frobenius, and is usually denoted 
with a subscript, as || Al. 

One simple property of the matrix norm is ||Al| = ||A7||, i.e., the norm of a 
matrix is the same as the norm of its transpose. Another one is 


IAI? = llall? +--+ + lanli’, 


where a1,...,a@n are the columns of A. In other words: The squared norm of a 
matrix is the sum of the squared norms of its columns. 


Matrix-vector multiplication 


If Ais an mxn matrix and zg is an n-vector, then the matrix-vector product y = Ax 
is the m-vector y with elements 


k=1 


As a simple example, we have 


e 2 Ed E sjone 
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Row and column interpretations. We can express the matrix-vector product in 
terms of the rows or columns of the matrix. From (6.4) we see that y; is the inner 
product of x with the ith row of A: 


T . 
yi =b; x, t=1,...,m, 


where b? is the row i of A. The matrix-vector product can also be interpreted in 
terms of the columns of A. If ap is the kth column of A, then y = Az can be 
written 

Y = T101 + T2042 +++: + Enan. 
This shows that y = Az is a linear combination of the columns of A; the coefficients 
in the linear combination are the elements of x. 


General examples. In the examples below, A is an m x n matrix and x is an 
n-vector. 


e Zero matriz. When A = 0, we have Ax = 0. In other words, 02 = 0. (The 
left-hand 0 is an m x n matrix, and the right-hand zero is an m-vector.) 


e Identity. We have Iz = x for any vector x. (The identity matrix here has 
dimension n x n.) In other words, multiplying a vector by the identity matrix 
gives the same vector. 


e Picking out columns and rows. An important identity is Ae; = aj, the jth 
column of A. Multiplying a unit vector by a matrix ‘picks out’ one of the 
columns of the matrix. ATe;, which is an n-vector, is the ith row of A, 
transposed. (In other words, (A‘e;)? is the ith row of A.) 


e Summing or averaging columns or rows. The m-vector A1 is the sum of the 
columns of A; its ith entry is the sum of the entries in the ith row of A. 
The m-vector A(1/n) is the average of the columns of A; its ith entry is 
the average of the entries in the ith row of A. In a similar way, AT1 is an 
n-vector, whose jth entry is the sum of the entries in the jth column of A. 


e Difference matrix. The (n — 1) x n matrix 


|= 1 QO :-:- 0 0 =| 

0 -l 1 --- 0 0 0 

D= (6.5) 
0 0 QO = -1 1 0 
0 0 O o 0 -1 1 


(where entries not shown are zero, and entries with diagonal dots are 1 or 
—1, continuing the pattern) is called the difference matriz. The vector Dz is 
the (n — 1)-vector of differences of consecutive entries of z: 


Tg — Ti 


T3 — T2 
Dz = 


Tn — In-1 
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e Running sum matrix. The n x n matrix 


1 0 0 o} 0 0 
1 1 0 0 0 
S= (6.6) 
1 1 1 1 0 
1 1 1 1 1 


is called the running sum matriz. The ith entry of the n-vector Sx is the 
sum of the first i entries of x: 


Tı 
zı + T2 
Sz = £1 + T2 + T3 
Liter +2 


Application examples. 


e Feature matriz and weight vector. Suppose X is a feature matrix, where its 
N columns 21,...,2y are feature n-vectors for N objects or examples. Let 
the n-vector w be a weight vector, and let s; = xTw be the score associated 
with object i using the weight vector w. Then we can write s = XTw, where 
s is the N-vector of scores of the objects. 


e Portfolio return time series. Suppose that R is a T x n asset return matrix, 
that gives the returns of n assets over T periods. A common trading strategy 
maintains constant investment weights given by the n-vector w over the T 
periods. For example, w4 = 0.15 means that 15% of the total portfolio value 
is held in asset 4. (Short positions are denoted by negative entries in w.) 
Then Rw, which is a T-vector, is the time series of the portfolio returns over 
the periods 1,...,T. 

As an example, consider a portfolio of the 4 assets in table 6.1, with weights 
w = (0.4,0.3,—0.2,0.5). The product Rw = (0.00213, —0.00201, 0.00241) 
gives the portfolio returns over the three periods in the example. 


e Polynomial evaluation at multiple points. Suppose the entries of the n-vector 
c are the coefficients of a polynomial p of degree n — 1 or less: 


pE) =e +et+---+q,4t" 7? H entat. 


Let t1,...,tm be m numbers, and define the m-vector y as yi = p(t;). Then 
we have y = Ac, where A is the m x n matrix 


lo ty oe a a 
Lt. He 2 Po 


L hae Xa 
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So multiplying a vector c by the matrix A is the same as evaluating a poly- 
nomial with coefficients c at m points. The matrix A in (6.7) comes up often, 
and is called a Vandermonde matrix (of degree n—1, at the points t1,...,tm), 
named for the mathematician Alexandre-Théophile Vandermonde. 


e Total price from multiple suppliers. Suppose the m x n matrix P gives the 
prices of n goods from m suppliers (or in m different locations). If q is an 
n-vector of quantities of the n goods (sometimes called a basket of goods), 
then c = Pq is an N-vector that gives the total cost of the goods, from each 
of the N suppliers. 


e Document scoring. Suppose A in an N xn document-term matrix, which gives 
the word counts of a corpus of N documents using a dictionary of n words, 
so the rows of A are the word count vectors for the documents. Suppose that 
w in an n-vector that gives a set of weights for the words in the dictionary. 
Then s = Aw is an N-vector that gives the scores of the documents, using 
the weights and the word counts. A search engine, for example, might choose 
w (based on the search query) so that the scores are predictions of relevance 
of the documents (to the search). 


e Audio mizing. Suppose the k columns of A are vectors representing audio 
signals or tracks of length T, and w is a k-vector. Then b = Aw is a T-vector 
representing the mix of the audio signals, with track weights given by the 
vector w. 


Inner product. When a and b are n-vectors, afb is exactly the inner product of a 
and b, obtained from the rules for transposing matrices and forming a matrix-vector 
product. We start with the (column) n-vector a, consider it as an n x 1 matrix, 
and transpose it to obtain the n-row-vector a’. Now we multiply this 1 x n matrix 
by the n-vector b, to obtain the 1-vector ab, which we also consider a scalar. 
So the notation afb for the inner product is just a special case of matrix-vector 
multiplication. 


Linear dependence of columns. We can express the concepts of linear depen- 
dence and independence in a compact form using matrix-vector multiplication. 
The columns of a matrix A are linearly dependent if Ax = 0 for some x 4 0. The 
columns of a matrix A are linearly independent if Ax = 0 implies x = 0. 


Expansion in a basis. If the columns of A are a basis, which means A is square 
with linearly independent columns aj,...,@,, then for any n-vector b there is a 
unique n-vector x that satisfies Ax = b. In this case the vector x gives the coeffi- 
cients in the expansion of b in the basis aj,..., an. 


Properties of matrix-vector multiplication. Matrix-vector multiplication satisfies 
several properties that are readily verified. First, it distributes across the vector 
argument: For any m x n matrix A and any n-vectors u and v, we have 


A(u +v) = Au + Av. 
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Matrix-vector multiplication, like ordinary multiplication of numbers, has higher 
precedence than addition, which means that when there are no parentheses to force 
the order of evaluation, multiplications are to be carried out before additions. This 
means that the right-hand side above is to be interpreted as (Au) + (Av). The 
equation above looks innocent and natural, but must be read carefully. On the 
left-hand side, we first add the vectors u and v, which is the addition of n-vectors. 
We then multiply the resulting n-vector by the matrix A. On the right-hand 
side, we first multiply each of n-vectors by the matrix A (this is two matrix-vector 
multiplies); and then add the two resulting m-vectors together. The left- and right- 
hand sides of the equation above involve very different steps and operations, but 
the final result of each is the same m-vector. 

Matrix-vector multiplication also distributes across the matrix argument: For 
any m x n matrices A and B, and any n-vector u, we have 


(A+ B)u= Au + Bu. 


On the left-hand side the plus symbol is matrix addition; on the right-hand side it 
is vector addition. 

Another basic property is, for any m x n matrix A, any n-vector u, and any 
scalar a, we have 


(aA)u = a(Au) 


(and so we can write this as aAu). On the left-hand side, we have scalar-matrix 
multiplication, followed by matrix-vector multiplication; on the right-hand side, we 
start with matrix-vector multiplication, and then perform scalar-vector multiplica- 
tion. (Note that we also have aAu = A(au).) 


Input-output interpretation. We can interpret the relation y = Ax, with A an 
m x n matrix, as a mapping from the n-vector x to the m-vector y. In this 
context we might think of x as an input, and y as the corresponding output. From 
equation (6.4), we can interpret A;; as the factor by which y; depends on zj. Some 
examples of conclusions we can draw are given below. 


e If Ags is positive and large, then y2 depends strongly on x3, and increases as 
£3 increases. 


e If A32 is much larger than the other entries in the third row of A, then y3 
depends much more on x2 than the other inputs. 


e If A is square and lower triangular, then y; only depends on z1,..., £i. 


Complexity 


Computer representation of matrices. An m x n matrix is usually represented 
on a computer as an m x n array of floating point numbers, which requires 8mn 
bytes. In some software systems symmetric matrices are represented in a more 
efficient way, by only storing the upper triangular elements in the matrix, in some 
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specific order. This reduces the memory requirement by around a factor of two. 
Sparse matrices are represented by various methods that encode for each nonzero 
element its row index 7 (an integer), its column index j (an integer) and its value 
Ai; (a floating point number). When the row and column indices are represented 
using 4 bytes (which allows m and n to range up to around 4.3 billion) this requires 
a total of around 16 nnz(A) bytes. 


Complexity of matrix addition, scalar multiplication, and transposition. The 
addition of two m x n matrices or a scalar multiplication of an m x n matrix 
each take mn flops. When A is sparse, scalar multiplication requires nnz(A) 
flops. When at least one of A and B is sparse, computing A + B requires at 
most min{nnz(A),nnz(B)} flops. (For any entry i,j for which one of A;; or Bij is 
zero, no arithmetic operations are needed to find (A+ B),;.) Matrix transposition, 
i.e., computing AT, requires zero flops, since we simply copy entries of A to those 
of AT. (Copying the entries does take time to carry out, but this is not reflected 
in the flop count.) 


Complexity of matrix-vector multiplication. A matrix-vector multiplication of 
an m x n matrix A with an n-vector x requires m(2n — 1) flops, which we simplify 
to 2mn flops. This can be seen as follows. The result y = Ax of the product is an 
m-vector, so there are m numbers to compute. The ith element of y is the inner 
product of the ith row of A and the vector x, which takes 2n — 1 flops. 

If A is sparse, computing Ax requires nnz(A) multiplies (of Aj; and £j, for 
each nonzero entry of A) and a number of additions that is no more than nnz(A). 
Thus, the complexity is between nnz(A) and 2nnz(A) flops. As a special example, 
suppose A is n x n and diagonal. Then Ax can be computed with n multiplies (Aj; 
times x;) and no additions, a total of n = nnz(A) flops. 
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Exercises 


6.1 Matriz and vector notation. Suppose ai,...,@, are m-vectors. Determine whether each 
expression below makes sense (i.e., uses valid notation). If the expression does make 
sense, give its dimensions. 


ay 
(a) | : 
an 
at 
(b) |: 
an 
(c) [ al > Gn ] 
(a) [af o a] 
6.2 Matriz notation. Suppose the block matrix 


A I 
IC 
makes sense, where A is a p x q matrix. What are the dimensions of C? 


6.3 Block matrix. Assuming the matrix 
I At 


makes sense, which of the following statements must be true? (‘Must be true’ means that 
it follows with no additional assumptions.) 


(a) K is square. 

(b) A is square or wide. 

(c) K is symmetric, i.e., KT = K. 

(d) The identity and zero submatrices in K have the same dimensions. 


(e) The zero submatrix is square. 


6.4 Adjacency matrix row and column sums. Suppose A is the adjacency matrix of a directed 
graph (see page 112). What are the entries of the vector A1? What are the entries of the 
vector A71? 

6.5 Adjacency matrix of reversed graph. Suppose A is the adjacency matrix of a directed 
graph (see page 112). The reversed graph is obtained by reversing the directions of all 
the edges of the original graph. What is the adjacency matrix of the reversed graph? 
(Express your answer in terms of A.) 


6.6 Matriz-vector multiplication. For each of the following matrices, describe in words how x 
and y = Az are related. In each case x and y are n-vectors, with n = 3k. 


0 
E 0 0 
(b) A=} 0 E 0 |, where E is the k x k matrix with all entries 1/k. 
0 0 E 
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Currency exchange matriz. We consider a set of n currencies, labeled 1,...,n. (These 
might correspond to USD, RMB, EUR, and so on.) At a particular time the exchange or 
conversion rates among the n currencies are given by an n x n (exchange rate) matrix R, 
where Rij is the amount of currency i that you can buy for one unit of currency j. (All 
entries of R are positive.) The exchange rates include commission charges, so we have 
RjiRij < 1 for alli Æ j. You can assume that Ri; = 1. 


Suppose y = Rx, where x is a vector (with nonnegative entries) that represents the 
amounts of the currencies that we hold. What is y;? Your answer should be in English. 


Cash flow to bank account balance. The T-vector c represents the cash flow for an interest 
bearing bank account over T time periods. Positive values of c indicate a deposit, and 
negative values indicate a withdrawal. The T-vector b denotes the bank account balance 
in the T periods. We have bı = cı (the initial deposit or withdrawal) and 


be = (1 + r)be-1 + ce, BS OAE 


where r > 0 is the (per-period) interest rate. (The first term is the previous balance plus 
the interest, and the second term is the deposit or withdrawal.) 


Find the T x T matrix A for which b = Ac. That is, the matrix A maps a cash flow 
sequence into a bank account balance sequence. Your description must make clear what 
all entries of A are. 


Multiple channel marketing campaign. Potential customers are divided into m market 
segments, which are groups of customers with similar demographics, e.g., college educated 
women aged 25-29. A company markets its products by purchasing advertising in a set of 
n channels, i.e., specific TV or radio shows, magazines, web sites, blogs, direct mail, and 
so on. The ability of each channel to deliver impressions or views by potential customers 
is characterized by the reach matrix, the m x n matrix R, where Rij is the number of 
views of customers in segment 7 for each dollar spent on channel j. (We assume that the 
total number of views in each market segment is the sum of the views from each channel, 
and that the views from each channel scale linearly with spending.) The n-vector c will 
denote the company’s purchases of advertising, in dollars, in the n channels. The m-vector 
v gives the total number of impressions in the m market segments due to the advertising 
in all channels. Finally, we introduce the m-vector a, where a; gives the profit in dollars 
per impression in market segment i. The entries of R, c, v, and a are all nonnegative. 


(a) Express the total amount of money the company spends on advertising using vec- 
tor/matrix notation. 

(b) Express v using vector/matrix notation, in terms of the other vectors and matrices. 
(c) Express the total profit from all market segments using vector/matrix notation. 


(d) How would you find the single channel most effective at reaching market segment 3, 
in terms of impressions per dollar spent? 


(e) What does it mean if R35 is very small (compared to other entries of R)? 


Resource requirements. We consider an application with n different job (types), each of 
which consumes m different resources. We define the m x n resource matrix R, with 
entry Rij giving the amount of resource 7 that is needed to run one unit of job j, for 
i = 1,...,m and j = 1,...,n. (These numbers are typically positive.) The number (or 
amount) of each of the different jobs to be processed or run is given by the entries of the 
n-vector x. (These entries are typically nonnegative integers, but they can be fractional 
if the jobs are divisible.) The entries of the m-vector p give the price per unit of each of 
the resources. 


(a) Let y be the m-vector whose entries give the total of each of the m resources needed 
to process the jobs given by x. Express y in terms of R and x using matrix and 
vector notation. 
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6.16 


6.17 


(b) Let c be an n-vector whose entries gives the cost per unit for each job type. (This 
is the total cost of the resources required to run one unit of the job type.) Express 
c in terms of R and p using matrix and vector notation. 


Remark. One example is a data center, which runs many instances of each of n types of 
application programs. The resources include number of cores, amount of memory, disk, 
and network bandwidth. 


Let A and B be two m x n matrices. Under each of the assumptions below, determine 
whether A = B must always hold, or whether A = B holds only sometimes. 

(a) Suppose Ax = Bx holds for all n-vectors x. 

(b) Suppose Ax = Ba for some nonzero n-vector x. 
Skew-symmetric matrices. An n x n matrix A is called skew-symmetric if AT = —A, i.e., 
its transpose is its negative. (A symmetric matrix satisfies A” = A.) 

(a) Find all 2 x 2 skew-symmetric matrices. 

(b) Explain why the diagonal entries of a skew-symmetric matrix must be zero. 


(c) Show that for a skew-symmetric matrix A, and any n-vector x, (Ax) L x. This 
means that Ax and z are orthogonal. Hint. First show that for any n x n matrix A 
and n-vector x, 7 (Ax) = et Ajj tiX;. 

(d) Now suppose A is any matrix for which (Ax) L x for any n-vector x. Show that A 
must be skew-symmetric. Hint. You might find the formula 


(ei + ej)” (A(ex + ej)) = Ani + Ajj + Aij + Aji, 
valid for any n x n matrix A, useful. For i = j, this reduces to e7 (Aei) = Ay. 


Polynomial differentiation. Suppose p is a polynomial of degree n — 1 or less, given by 
p(t) = c1 teat +--- + ent" 1. Its derivative (with respect to t) p'(t) is a polynomial of 
degree n — 2 or less, given by p’(t) = dı +d2t+- -+ dn—it”~?. Find a matrix D for which 
d = Dc. (Give the entries of D, and be sure to specify its dimensions.) 


Norm of matrix-vector product. Suppose A is an m x n matrix and gv is an n-vector. A 
famous inequality relates ||2||, || Al], and || Ac||: 


|| Az|| < || Alllla. 


The left-hand side is the (vector) norm of the matrix-vector product; the right-hand side 
is the (scalar) product of the matrix and vector norms. Show this inequality. Hints. Let 
a; be the ith row of A. Use the Cauchy-Schwarz inequality to get (aj x)? < |lai||?||a||?. 
Then add the resulting m inequalities. 


Distance between adjacency matrices. Let A and B be the n x n adjacency matrices of 
two directed graphs with n vertices (see page 112). The squared distance ||A — B||? can 
be used to express how different the two graphs are. Show that ||A — B||? is the total 
number of directed edges that are in one of the two graphs but not in the other. 


Columns of difference matrix. Are the columns of the difference matrix D, defined in (6.5), 
linearly independent? 


Stacked matriz. Let A be an m x n matrix, and consider the stacked matrix S defined by 


s-[4]. 


When does S have linearly independent columns? When does S have linearly independent 
rows? Your answer can depend on m, n, or whether or not A has linearly independent 
columns or rows. 


6.18 


6.19 


6.20 


6.21 


6.22 


Exercises 


127 


Vandermonde matrices. A Vandermonde matrix is an m x n matrix of the form 


lot tf zes tht 
1 t2 o -got 
V= 

t-te, “Hee eee eet 
where t1,...,¢m are numbers. Multiplying an n-vector c by the Vandermonde matrix V 
is the same as evaluating the polynomial of degree less than n, with coefficients c1,...,Cn, 
at the points t1,...,tm; see page 120. Show that the columns of a Vandermonde matrix 
are linearly independent if the numbers t1,...,tm are distinct, i.e., different from each 


other. Hint. Use the following fact from algebra: If a polynomial p with degree less than 
n has n or more roots (points t for which p(t) = 0) then all its coefficients are zero. 


Back-test timing. The T x n asset returns matrix R gives the returns of n assets over 
T periods. (See page 120.) When the n-vector w gives a set of portfolio weights, the 
T-vector Rw gives the time series of portfolio return over the T time periods. Evaluating 
portfolio return with past returns data is called back-testing. 

Consider a specific case with n = 5000 assets, and T = 2500 returns. (This is 10 years 
of daily returns, since there are around 250 trading days in each year.) About how long 
would it take to carry out this back-test, on a 1 Gflop/s computer? 


Complexity of matrix-vector multiplication. On page 123 we worked out the complexity of 
computing the m-vector Az, where A is an m x n matrix and gx is an n-vector, when each 
entry of Ax is computed as an inner product of a row of A and the vector x. Suppose 
instead that we compute Az as a linear combination of the columns of A, with coefficients 
%1,..-,2%n. How many flops does this method require? How does it compare to the method 
described on page 123? 


Complexity of matriz-sparse-vector multiplication. On page 123 we consider the complex- 
ity of computing Az, where A is a sparse m x n matrix and x is an n-vector x (not 
assumed to be sparse). Now consider the complexity of computing Ax when the m x n 
matrix A is not sparse, but the n-vector x is sparse, with nnz(x) nonzero entries. Give 
the total number of flops in terms of m, n, and nnz(zx), and simplify it by dropping terms 
that are dominated by others when the dimensions are large. Hint. The vector Az is a 
linear combination of nnz(x) columns of A. 


Distribute or not? Suppose you need to compute z = (A+ B)(x +y), where A and B are 
m x n matrices and x and y are n-vectors. 


(a) What is the approximate flop count if you evaluate z as expressed, i.e., by adding 
A and B, adding x and y, and then carrying out the matrix-vector multiplication? 


(b) What is the approximate flop count if you evaluate z as z = Ax + Ay + Ba + By, 
i.e., with four matrix-vector multiplies and three vector additions? 


(c) Which method requires fewer flops? Your answer can depend on m and n. Remark. 
When comparing two computation methods, we usually do not consider a factor of 
2 or 3 in flop counts to be significant, but in this exercise you can. 
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Matrix examples 


In this chapter we describe some special matrices that occur often in applications. 


Geometric transformations 


Suppose the 2-vector (or 3-vector) x represents a position in 2-D (or 3-D) space. 
Several important geometric transformations or mappings from points to points 
can be expressed as matrix-vector products y = Ax, with A a 2 x 2 (or 3 x 3) 
matrix. In the examples below, we consider the mapping from x to y, and focus 
on the 2-D case (for which some of the matrices are simpler to describe). 


Scaling. Scaling is the mapping y = ax, where a is a scalar. This can be expressed 
as y = Ax with A = al. This mapping stretches a vector by the factor |a| (or 
shrinks it when |a| < 1), and it flips the vector (reverses its direction) if a < 0. 


Dilation. Dilation is the mapping y = Dz, where D is a diagonal matrix, D = 
diag(dı, d2). This mapping stretches the vector x by different factors along the 
two different axes. (Or shrinks, if |d;| < 1, and flips, if d; < 0.) 


Rotation. Suppose that y is the vector obtained by rotating x by 0 radians coun- 
terclockwise. Then we have 


_ | cos —sind |» ea 


sin 0 cos 0 
This matrix is called (for obvious reasons) a rotation matriz. 


Reflection. Suppose that y is the vector obtained by reflecting x through the line 
that passes through the origin, inclined 0 radians with respect to horizontal. Then 


we have (26) in(20) 
cos sin 
I= | sin(20) —cos(26) |e 
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Figure 7.1 From left to right: A dilation with A = diag(2,2/3), a coun- 
terclockwise rotation by 7/6 radians, and a reflection through a line that 
makes an angle of 7/4 radians with the horizontal line. 


Projection onto a line. The projection of the point x onto a set is the point in 
the set that is closest to x. Suppose y is the projection of x onto the line that 
passes through the origin, inclined 0 radians with respect to horizontal. Then we 
have 
(1/2)(1 + cos(26)) (1/2) sin(26) 

(1/2) sin(20) — (1/2)(1 — cos(20)) |” 


Some of these geometric transformations are illustrated in figure 7.1. 


Finding the matrix. When a geometric transformation is represented by matrix- 
vector multiplication (as in the examples above), a simple method to find the 
matrix is to find its columns. The ith column is the vector obtained by applying 
the transformation to e;. As a simple example consider clockwise rotation by 90° 
in 2-D. Rotating the vector e; = (1,0) by 90° gives (0, —1); rotating e2 = (0,1) by 
90° gives (1,0). So rotation by 90° is given by 


Change of coordinates. In many applications multiple coordinate systems are 
used to describe locations or positions in 2-D or 3-D. For example in aerospace 
engineering we can describe a position using earth-fixed coordinates or body-fixed 
coordinates, where the body refers to an aircraft. Earth-fixed coordinates are with 
respect to a specific origin, with the three axes pointing East, North, and straight 
up, respectively. The origin of the body-fixed coordinates is a specific location on 
the aircraft (typically the center of gravity), and the three axes point forward (along 
the aircraft body), left (with respect to the aircraft body), and up (with respect 
to the aircraft body). Suppose the 3-vector x>°?Y describes a location using the 
body coordinates, and 2°*"t® describes the same location in earth-fixed coordinates. 
These are related by 
geth =p+ Qr, 


where p is the location of the airplane center (in earth-fixed coordinates) and Q is 
a 3 x 3 matrix. The ith column of Q gives the earth-fixed coordinates for the ith 
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axis of the airplane. For an airplane in level flight, heading due South, we have 


010 
Q=| -1 0 0 
00 1 


Selectors 


An m xn selector matriz A is one in which each row is a unit vector (transposed): 


T 
Ck, 
A= : ; 
ep. 
where kj,...,km are integers in the range 1,...,n. When it multiplies a vector, it 


simply copies the k;th entry of x into the ith entry of y = Az: 
y= (Tki; Tka, oe Tkm): 


In words, each entry of Az is a selection of an entry of x. 
The identity matrix, and the reverser matriz 


00 0 1 
eT 00- 10 
AS SE a, a a 
eT O1- 0 0 
10- 00 


are special cases of selector matrices. (The reverser matrix reverses the order of 
the entries of a vector: Aw = (£n, &n—1,. - -, £2, £1).) Another one is the r:s slicing 
matriz, which can be described as the block matrix 


A= [ Om» (r—1) Imxm Omx(n—s) | ) 


where m = s — r + 1. (We show the dimensions of the blocks for clarity.) We have 
Ax = Lr:g, i.e., multiplying by A gives the r:s slice of a vector. 


Down-sampling. Another example is the (n/2) x n matrix (with n even) 


100000 +. 00 0 0 
001000 >. 00 0 0 
000010 >. 00 0 0 
A= 
000000 > 10 0 0 
000000 > 00 1 0 
If y = Ax, we have y = (£1, %3,U5,...,Un—3,Un—1). When z is a time series, y is 


called the 2x down-sampled version of x. If x is a quantity sampled every hour, 
then y is the same quantity, sampled every 2 hours. 
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Figure 7.2 Directed graph with four vertices and five edges. 


Image cropping. As a more interesting example, suppose that x is an image with 
M x N pixels, with M and N even. (That is, x is an M N-vector, with its entries 
giving the pixel values in some specific order.) Let y be the (M/2) x (N/2) image 
that is the upper left corner of the image zx, i.e., a cropped version. Then we have 
y = Ax, where A is an (MN/4) x (MN) selector matrix. The ith row of A is ef. 
where k; is the index of the pixel in x that corresponds to the ith pixel in y. 


Permutation matrices. An nxn permutation matrix is one in which each column 
is a unit vector, and each row is the transpose of a unit vector. (In other words, A 
and A? are both selector matrices.) Thus, exactly one entry of each row is one, and 
exactly one entry of each column is one. This means that y = Ax can be expressed 
as Yi = Zr;, where 7 is a permutation of 1,2,...,n, i.e., each integer from 1 to n 
appears exactly once in T1,..., Tn. 

As a simple example consider the permutation 7 = (3,1,2). The associated 
permutation matrix is 


0 0 1 
A=]1 0 0 
0 1 0 


Multiplying a 3-vector by A re-orders its entries: Ax = (x3, £1, £2). 


Incidence matrix 


Directed graph. A directed graph consists of a set of vertices (or nodes), labeled 
1,...,n, and a set of directed edges (or branches), labeled 1,...,m. Each edge is 
connected from one of the nodes and into another one, in which case we say the 
two nodes are connected or adjacent. Directed graphs are often drawn with the 
vertices as circles or dots, and the edges as arrows, as in figure 7.2. A directed 
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graph can be described by its n x m incidence matrix, defined as 


1 edge j points to node i 
Ajj =< —1 edge j points from node 7 
0 otherwise. 


The incidence matrix is evidently sparse, since it has only two nonzero entries 
in each column (one with value 1 and other with value —1). The jth column is 
associated with the jth edge; the indices of its two nonzero entries give the nodes 
that the edge connects. The ith row of A corresponds to node i; its nonzero entries 
tell us which edges connect to the node, and whether they point into or away from 
the node. The incidence matrix for the graph shown in figure 7.2 is 


A directed graph can also be described by its adjacency matrix, described on 
page 112. The adjacency and incidence matrices for a directed graph are closely 
related, but not the same. The adjacency matrix does not explicitly label the 
edges 7 = 1,...,m. There are also some small differences in the graphs that can 
be represented using incidence and adjacency matrices. For example, self edges 
(that connect from and to the same vertex) cannot be represented in an incidence 
matrix. 


7.3.1 Networks 


In many applications a graph is used to represent a network, through which some 
commodity or quantity such as electricity, water, heat, or vehicular traffic flows. 
The edges of the graph represent the paths or links over which the quantity can 
move or flow, in either direction. If x is an m-vector representing a flow in the 
network, we interpret x; as the flow (rate) along the edge j, with a positive value 
meaning the flow is in the direction of edge j, and negative meaning the flow is 
in the opposite direction of edge j. In a network, the direction of the edge or link 
does not specify the direction of flow; it only specifies which direction of flow we 
consider to be positive. 


Flow conservation. When <x represents a flow in a network, the matrix-vector 
product y = Az can be given a very simple interpretation. The n-vector y = Ax 
can be interpreted as the vector of net flows, from the edges, into the nodes: y; is 
equal to the total of the flows that come in to node 7, minus the total of the flows 
that go out from node i. The quantity y; is sometimes called the flow surplus at 
node i. 

If Ax = 0, we say that flow conservation occurs, since at each node, the total in- 
flow matches the total out-flow. In this case the flow vector x is called a circulation. 
This could be used as a model of traffic flow (in a closed system), with the nodes 
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T3 
S2 ———— A— 53 
Tı T4 T5 
T2 
S1 ——> Q— S4 


Figure 7.3 Network with four nodes and five edges, with source flows shown. 


representing intersections and the edges representing road segments (one for each 
direction). 
For a network described by the directed graph example above, the vector 


z = (1, —1, 1,0,1) 


is a circulation, since Ax = 0. This flow corresponds to a unit clockwise flow on 
the outer edges (1, 3, 5, and 2) and no flow on the diagonal edge (4). (Visualizing 
this explains why such vectors are called circulations.) 


Sources. In many applications it is useful to include additional flows called source 
flows or exogenous flows, that enter or leave the network at the nodes, but not along 
the edges, as shown in figure 7.3. We denote these flows with an n-vector s. We can 
think of s; as a flow that enters the network at node 7 from outside, i.e., not from 
any edge. When s; > 0 the exogenous flow is called a source, since it is injecting 
the quantity into the network at the node. When s; < 0 the exogenous flow is 
called a sink, since it is removing the quantity from the network at the node. 


Flow conservation with sources. The equation Ax + s = 0 means that the flow 
is conserved at each node, counting the source flow: The total of all incoming flow, 
from the incoming edges and exogenous source, minus the total outgoing flow from 
outgoing edges and exogenous sinks, is zero. 

As an example, flow conservation with sources can be used as an approximate 
model of a power grid (ignoring losses), with x being the vector of power flows 
along the transmission lines, s; > 0 representing a generator injecting power into 
the grid at node i, s; < 0 representing a load that consumes power at node i, and 
si = 0 representing a substation where power is exchanged among transmission 
lines, with no generation or load attached. 

For the example above, consider the source vector s = (1,0,—1,0), which cor- 
responds to an injection of one unit of flow into node 1, and the removal of one 
unit of flow at node 3. In other words, node 1 is a source, node 3 is a sink, and 
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flow is conserved at nodes 2 and 4. For this source, the flow vector 
x = (0.6, 0.3, 0.6, —0.1, —0.3) 


satisfies flow conservation, i.e., Ax + s = 0. This flow can be explained in words: 
The unit external flow entering node 1 splits three ways, with 0.6 flowing up, 
0.3 flowing right, and 0.1 flowing diagonally up (on edge 4). The upward flow on 
edge 1 passes through node 2, where flow is conserved, and proceeds right on edge 3 
towards node 3. The rightward flow on edge 2 passes through node 4, where flow 
is conserved, and proceeds up on edge 5 to node 3. The one unit of excess flow 
arriving at node 3 is removed as external flow. 


Node potentials. A graph is also useful when we focus on the values of some 
quantity at each graph vertex or node. Let v be an n-vector, often interpreted as a 
potential, with v; the potential value at node i. We can give a simple interpretation 
to the matrix-vector product u = A?v. The m-vector u = Av gives the potential 
differences across the edges: uj = vj — vp, where edge j goes from node k to node 1. 


Dirichlet energy. When the m-vector A’ v is small, it means that the potential 
differences across the edges are small. Another way to say this is that the potentials 
of connected vertices are near each other. A quantitative measure of this is the 
function of v given by 


D(v) = || Av]. 


This function arises in many applications, and is called the Dirichlet energy (or 
Laplacian quadratic form) associated with the graph. It can be expressed as 


Dv)= $, (u-u%), 


edges (k,l) 


which is the sum of the squares of the potential differences of v across all edges in 
the graph. The Dirichlet energy is small when the potential differences across the 
edges of the graph are small, i.e., nodes that are connected by edges have similar 
potential values. 

The Dirichlet energy is used as a measure the non-smoothness (roughness) of 
a set of node potentials on a graph. A set of node potentials with small Dirichlet 
energy can be thought of as smoothly varying across the graph. Conversely, a set 
of potentials with large Dirichlet energy can be thought of as non-smooth or rough. 
The Dirichlet energy will arise as a measure of roughness in several applications 
we will encounter later. 

As a simple example, consider the potential vector v = (1,—1,2,—1) for the 
graph shown in figure 7.2. For this set of potentials, the potential differences 
across the edges are relatively large, with ATv = (—2,—2,3,—1,—3), and the 
associated Dirichlet energy is || A7v||? = 27. Now consider the potential vector v = 
(1,2,2,1). The associated edge potential differences are A?v = (1,0,0,—1,—1), 
and the Dirichlet energy has the much smaller value || A? v||? = 3. 
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Figure 7.4 Chain graph. 


20 


Figure 7.5 Two vectors of length 100, with Dirichlet energy D(a) = 1.14 and 
D(b) = 8.99. 


Chain graph. The incidence matrix and the Dirichlet energy function have a 
particularly simple form for the chain graph shown in figure 7.4, with n vertices 
and n—1 edges. The n x (n — 1) incidence matrix is the transpose of the difference 
matrix D described on page 119, in (6.5). The Dirichlet energy is then 


D(v) = ||Dul|? = (v2 = v1)? +-+: + (Un = Un-1)?, 


the sum of squares of the differences between consecutive entries of the n-vector v. 
This is used as a measure of the non-smoothness of the vector v, considered as a 
time series. Figure 7.5 shows an example. 


Convolution 


The convolution of an n-vector a and an m-vector b is the (n + m — 1)-vector 
denoted c = a x b, with entries 


Ck = 5 aibj, k=1,...,.n+m-1, (7.2) 
i+j=k+1 


where the subscript in the sum means that we should sum over all values of 7 and 
j in their index ranges 1,...,n and 1,...,m, for which the sum i + j isk+1. For 
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example, with n = 4, m = 3, we have 


Ci = aıbı 

C2 = @ıb2 + abı 

C3 = a1b3 + az2b2 + a3bı 
C4 = a2b3 + a3b2 + a4bı 
Cs = a3b3 + a4b2 

Ce = agbs. 


Convolution reduces to ordinary multiplication of numbers when n = m = 1, and 
to scalar-vector multiplication when either n = 1 or m = 1. Convolution arises in 
many applications and contexts. 

As a specific numerical example, we have (1,0, —1)*(2,1,—1) = (2,1, —3,—1, 1), 
where the entries of the convolution result are found from 


2 = (1)(2) 


1 = (101) + (0)(2) 
=3 = (1)(=1) + (0)(1) + (~=1)(2) 
-1 = (0)(=1)+(=1)0) 

1 (=1)(=1). 


Polynomial multiplication. Ifa and b represent the coefficients of two polynomials 


p(x) =a, + a28 +- + anz” t, g(a) = by + bor +- + bme™ t, 


then the coefficients of the product polynomial p(x)q(x) are represented by c = axb: 


p(x)q(x) = cy + cot +++ + engm—1z"*™ ”. 

To see this we will show that cx is the coefficient of 2*~1 in p(x)q(x). We expand the 
product polynomial into mn terms, and collect those terms associated with x*—!. 
These terms have the form a;bj;z'tJ~?, for i and j that satisfy i+ j — 2 = k — 1, 
e, i+j =k—1. It follows that ck = 30,4;-441 4ibj, which agrees with the 
convolution formula (7.2). 


Properties of convolution. Convolution is symmetric: We havea*b=bxa. It 
is also associative: We have (a *b) *c = a x (b*c), so we can write both as a * b * c. 
Another property is that a xb = 0 implies that either a = 0 or b = 0. These 
properties follow from the polynomial coefficient property above, and can also be 
directly shown. As an example, let us show that a xb = b* a. Suppose p is the 
polynomial with coefficients a, and q is the polynomial with coefficients b. The two 
polynomials p(x)q(x) and q(«)p(x) are the same (since multiplication of numbers 
is commutative), so they have the same coefficients. The coefficients of p(x)q(x) 
are a * b and the coefficients of g(x)p(x) are b * a. These must be the same. 

A basic property is that for fixed a, the convolution a x b is a linear function 
of b; and for fixed b, it is a linear function of a. This means we can express a * b as 
a matrix-vector product: 

ax b= T(b)a = T (a)b, 
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where T(b) is the (n +m — 1) x n matrix with entries 


EE bi—j+1 1<i-g4+l<m 
We { 0 otherwise i) 
and similarly for T(a). For example, with n = 4 and m = 3, we have 
by 0 0 0 ay 0 0 
b b 0 0 az a, 0 
_ b3 b b 0 -_ a3 @2 ay 
TID) E 0 bg by bi ? T(a) E a4 a3 Q2 
0 0 b bg 0 a4 ag 
0 0 0 b3 0 0 a4 


The matrices T(b) and T(a) are called Toeplitz matrices (named after the math- 
ematician Otto Toeplitz), which means the entries on any diagonal (i.e., indices 
with i — j constant) are the same. The columns of the Toeplitz matrix T(a) are 
simply shifted versions of the vector a, padded with zero entries. 


Variations. Several slightly different definitions of convolution are used in different 
applications. In one variation, a and b are infinite two-sided sequences (and not 
vectors) with indices ranging from —oo to oo. In another variation, the rows of 
T(a) at the top and bottom that do not contain all the coefficients of a are dropped. 
(In this version, the rows of T(a) are shifted versions of the vector a, reversed.) 
For consistency, we will use the one definition (7.2). 


Examples. 


e Time series smoothing. Suppose the n-vector x is a time series, and a = 
(1/3, 1/3,1/3). Then the (n + 2)-vector y = a * x can be interpreted as a 
smoothed version of the original time series: for i = 3,...,n, yi is the average 
of £i, £i—1, £i—-2. The time series y is called the (3-period) moving average 
of the time series x. Figure 7.6 shows an example. 


e First order differences. If the n-vector x is a time series and a = (1,—1), the 
time series y = a * x gives the first order differences in the series x: 


y = (£1, £2 — T1, 13 — T2, ..., Ln — Ln-1, —Ln). 


(The first and last entries here would be the first order difference if we take 
To = Un4+1 = 0.) 


e Audio filtering. If the n-vector x is an audio signal, and a is a vector (typically 
with length less than around 0.1 second of real time) the vector y = a * x 
is called the filtered audio signal, with filter coefficients a. Depending on 
the coefficients a, y will be perceived as enhancing or suppressing different 
frequencies, like the familiar audio tone controls. 


e Communication channel. In a modern data communication system, a time 
series u is transmitted or sent over some channel (e.g., electrical, optical, or 
radio) to a receiver, which receives the time series y. A very common model 
is that y and u are related via convolution: y = c * u, where the vector c is 
the channel impulse response. 
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Figure 7.6 Top. A time series represented by a vector x of length 100. 
Bottom. The 3-period moving average of the time series as a vector of 
length 102. This vector is the convolution of x with a = (1/3, 1/3, 1/3). 
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Input-output convolution system. Many physical systems with an input (time 
series) m-vector u and output (time series) y are well modeled as y = hxu, where the 
n-vector h is called the system impulse response. For example, uz; might represent 
the power level of a heater at time period t, and yg might represent the resulting 
temperature rise (above the surrounding temperature). The lengths of u and y, n 
and m+n — 1, are typically large, and not relevant in these applications. We can 
express the ith entry of the output y as 


where we interpret uz as zero for k < 0 or k > n. This formula states that y at 
time 7 is a linear combination of Uui, wi—1,---,Ui—n+1, Łe., a linear combination 
of the current input value u;, and the past n — 1 input values u;_1,...,Ui—n41- 
The coefficients are precisely hi,...,h». Thus, ha can be interpreted as the factor 
by which the current output depends on what the input was, 2 time steps before. 
Alternatively, we can say that hg is the factor by which the input at any time will 
affect the output 2 steps in the future. 


Complexity of convolution. The naive method to compute the convolution c = 
a* b of an n-vector a and an m-vector b, using the basic formula (7.2) to calculate 
each cz, requires around 2mn flops. The same number of flops is required to 
compute the matrix-vector products T(a)b or T(b)a, taking into account the zeros 
at the top right and bottom left in the Toeplitz matrices T(b) and T(a). Forming 
these matrices requires us to store mn numbers, even though the original data 
contains only m + n numbers. 

It turns out that the convolution of two vectors can be computed far faster, using 
a so-called fast convolution algorithm. By exploiting the special structure of the 
convolution equations, this algorithm can compute the convolution of an n-vector 
and an m-vector in around 5(m +n) log,(m + n) flops, and with no additional 
memory requirement beyond the original m + n numbers. The fast convolution 
algorithm is based on the fast Fourier transform (FFT), which is beyond the scope 
of this book. (The Fourier transform is named for the mathematician Jean-Baptiste 
Fourier.) 


2-D convolution 


Convolution has a natural extension to multiple dimensions. Suppose that A is an 
mxn matrix and B is a pxq matrix. Their convolution is the (m+p—1) x (n+q—1) 
matrix 


Crs = 5 AjyBu, r=1,...,m+p-—-1, s=1,...,.n+q-1, 
itk=r+1, jtl=s+1 


where the indices are restricted to their ranges (or alternatively, we assume that 
Aij and B,; are zero, when the indices are out of range). This is not denoted 
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C = Ax B, however, in standard mathematical notation. So we will use the 
notation C= Ax B. 

The same properties that we observed for 1-D convolution hold for 2-D convo- 
lution: We have Ax B = Bx A, (Ax B)xC = Ax (B xC), and for fixed B, Ax B 
is a linear function of A. 


Image blurring. Ifthe mxn matrix X represents an image, Y = X xB represents 
the effect of blurring the image by the point spread function (PSF) given by the 
entries of the matrix B. If we represent X and Y as vectors, we have y = T(B)a, 
for some (m + p — 1)(n + q — 1) x mn-matrix T(B). 


As an example, with 
_ | 1/4 1/4 
RE | 1/4 1/4 | , (rA) 
Y = X x B is an image where each pixel value is the average of a 2 x 2 block of 4 


adjacent pixels in X. The image Y would be perceived as the image X, with some 
blurring of the fine details. This is illustrated in figure 7.7 for the 8 x 9 matrix 


1 1 1 1 1 1 1 1 1 


(7.5) 


.erhehehehehe 
Bee er OF 
FPreoqonceoo Fe 
BPROOCOF 
eel eel aati enti eesti oot ee’ 
.e.erehehehehe 


and its convolution with B, 


1/4 1/2 1/2 1/2 1/2 1/2 1/2 1/2 1/2 1/4 
1/2 1 1 1 1 1 1 1 1 1⁄2 
1/2 1 3/4 1/2 1/2 1/2 1/2 3/4 1 1/2 
1/2 1 3/4 1/4 1/4 1/2 1/4 1/2 1 1/2 
XeB= (12 1 1 1/2 1/2 1 19419 1 172 
1/2 1 1 1/2 1/2 1 1/2 1/2 1 1p 
1/2 1 1 3/4 3/4 1 3/4 3/4 1 1p 
1/2 1 1 1 1 1 1 1 1 1⁄2 
1/4 1/2 1/2 1/2 1/2 1/2 1/2 1/2 1/2 1/4 


With the point spread function 
pror p [ 1 —1 ] l 


the pixel values in the image Y = X x D®® are the horizontal first order differences 
of those in X: 


Yy = Xij — Xi j-1; T= esM, J= Daeg 


(and Yi = Xin, Xingi = —Xin for i = 1,...,m). With the point spread function 


ver __ 1 
a 
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Figure 7.7 An 8 x 9 image and its convolution with the point spread func- 
tion (7.4). 


the pixel values in the image Y = X x DY are the vertical first order differences 
of those in X: 


Yi; = Xij — Xi-1,5; C= Doon M F= ha 


(and Yij = Xij, Xm41,j = —Xmj for j = 1,...,n). As an example, the convolu- 
tions of the matrix (7.5) with DT and DY® are 


10 0 000 000 -1 
10 0 000 000 -1 

1 0 -1 000 010 -1 

hor 1 0 0 -1 1 0 -1 1 0 -1 
XxD" =| o 0 -110-110 -1 
1 0 0 -1 1 0 -1 1 0 -1 
10 0 000 000 -1 
10 0 000 000 -1 

and 

i a. 1 1 1 1 1 1 1 

0 0 0 0 0 0 0 0 0 

0 0 -1 -1 -1 -1 at 0 0 

0 0 1 0 1 1 0 0 0 
XxD*=| @ 0 0 © 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 

0 0 0 1 0 0 1 0 0 

0 0 0 0 0 0 0 0 0 

i =i -1 =1 -1 -1 =i =i = 


Figure 7.8 shows the effect of convolution on a larger image. The figure shows 
an image of size 512 x 512 and its convolution with the 8 x 8 matrix B with constant 
entries Bij = 1/64. 
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Figure 7.8 512 x 512 image and the 519 x 519 image that results from the 
convolution of the first image with an 8 x 8 matrix with constant entries 
1/64. Image credit: NASA. 
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Exercises 


Projection on a line. Let P(x) denote the projection of the 2-D point (2-vector) x onto 
the line that passes through (0,0) and (1,3). (This means that P(x) is the point on the 
line that is closest to x; see exercise 3.12.) Show that P is a linear function, and give the 
matrix A for which P(x) = Az for any z. 


3-D rotation. Let x and y be 3-vectors representing positions in 3-D. Suppose that the 
vector y is obtained by rotating the vector x about the vertical axis (i.e., e3) by 45° 
(counterclockwise, i.e., from e1 toward e2). Find the 3 x 3 matrix A for which y = Az. 
Hint. Determine the three columns of A by finding the result of the transformation on 
the unit vectors e1, €2, e3. 


Trimming a vector. Find a matrix A for which Ax = (a2,...,%n—1), where x is an 
n-vector. (Be sure to specify the size of A, and describe all its entries.) 


Down-sampling and up-conversion. We consider n-vectors x that represent signals, with 
£k the value of the signal at time k for k = 1,...,n. Below we describe two functions of 
x that produce new signals f(x). For each function, give a matrix A such that f(x) = Ax 
for all x. 


(a) 2x downsampling. We assume n is even and define f(x) as the n/2-vector y with 
elements yx = £2. To simplify your notation you can assume that n = 8, i.e., 


f(z) = (x2, T4, T6, zs). 
(On page 131 we describe a different type of down-sampling, that uses the average 


of pairs of original values.) 


(b) 2x up-conversion with linear interpolation. We define f(x) as the (2n — 1)-vector y 
with elements yk = £(k+1)/2 if k is odd and yk = (£k/2 + £k/2+1)/2 if k is even. To 
simplify your notation you can assume that n = 5, i.e., 


= C1 + 2X2 L2 + 2X3 £3 + £4 £4 + £5 
f(x) Ti, 2 » T2, 9 , U3, 2 » T4, 2 » T5 


Transpose of selector matrix. Suppose the m x n matrix A is a selector matrix. Describe 
the relation between the m-vector u and the n-vector v = ATu. 


Rows of incidence matrix. Show that the rows of the incidence matrix of a graph are 
always linearly dependent. Hint. Consider the sum of the rows. 


Incidence matriz of reversed graph. (See exercise 6.5.) Suppose A is the incidence matrix 
of a graph. The reversed graph is obtained by reversing the directions of all the edges of 
the original graph. What is the incidence matrix of the reversed graph? (Express your 
answer in terms of A.) 


Flow conservation with sources. Suppose that A is the incidence matrix of a graph, x is 
the vector of edge flows, and s is the external source vector, as described in §7.3. Assuming 
that flow is conserved, i.e., Ax +s = 0, show that 17s = 0. This means that the total 
amount injected into the network by the sources (s; > 0) must exactly balance the total 
amount removed from the network at the sink nodes (s; < 0). For example if the network 
is a (lossless) electrical power grid, the total amount of electrical power generated (and 
injected into the grid) must exactly balance the total electrical power consumed (from 
the grid). 


Social network graph. Consider a group of n people or users, and some symmetric social 
relation among them. This means that some pairs of users are connected, or friends (say). 
We can create a directed graph by associating a node with each user, and an edge between 
each pair of friends, arbitrarily choosing the direction of the edge. Now consider an n- 
vector v, where v; is some quantity for user 7, for example, age or education level (say, 
given in years). Let D(v) denote the Dirichlet energy associated with the graph and v, 
thought of as a potential on the nodes. 
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Figure 7.9 Tree with six vertices. 


(a) Explain why the number D(v) does not depend on the choice of directions for the 
edges of the graph. 


(b) Would you guess that D(v) is small or large? This is an open-ended, vague question; 
there is no right answer. Just make a guess as to what you might expect, and give 
a short English justification of your guess. 


7.10 Circle graph. A circle graph (also called a cycle graph) has n vertices, with edges pointing 
from vertex 1 to vertex 2, from vertex 2 to vertex 3, ..., from vertex n — 1 to vertex n, 
and finally, from vertex n to vertex 1. (This last edge completes the circle.) 


(a) Draw a diagram of a circle graph, and give its incidence matrix A. 
(b) Suppose that x is a circulation for a circle graph. What can you say about x? 


(c) Suppose the n-vector v is a potential on a circle graph. What is the Dirichlet energy 
D(v) = ||ATv|?? 


Remark. The circle graph arises when an n-vector v represents a periodic time series. For 
example, vı could be the value of some quantity on Monday, v2 its value on Tuesday, and 
v7 its value on Sunday. The Dirichlet energy is a measure of the roughness of such an 
n-vector v. 


7.11 Tree. An undirected graph is called a tree if it is connected (there is a path from every 
vertex to every other vertex) and contains no cycles, i.e., there is no path that begins and 
ends at the same vertex. Figure 7.9 shows a tree with six vertices. For the tree in the 
figure, find a numbering of the vertices and edges, and an orientation of the edges, so that 
the incidence matrix A of the resulting directed graph satisfies A;; = 1 for i = 1,...,5 
and A;; = 0 for i < j. In other words, the first 5 rows of A form a lower triangular matrix 
with ones on the diagonal. 


7.12 Some properties of convolution. Suppose that a is an n-vector. 


(a) Convolution with 1. What is 1 xa? (Here we interpret 1 as a 1-vector.) 


(b) Convolution with a unit vector. What is ex * a, where ex is the kth unit vector of 
dimension q? Describe this vector mathematically (i.e., give its entries), and via a 
brief English description. You might find vector slice notation useful. 


7.13 Sum property of convolution. Show that for any vectors a and b, we have 17(a* b) = 
(17a)(17b). In words: The sum of the coefficients of the convolution of two vectors is the 
product of the sums of the coefficients of the vectors. Hint. If the vector a represents the 
coefficients of a polynomial p, 17a = p(1). 
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Rainfall and river height. The T-vector r gives the daily rainfall in some region over a 
period of T days. The vector h gives the daily height of a river in the region (above its 
normal height). By careful modeling of water flow, or by fitting a model to past data, it 
is found that these vectors are (approximately) related by convolution: h = g xr, where 


g = (0.1, 0.4, 0.5, 0.2). 


Give a short story in English (with no mathematical terms) to approximately describe 
this relation. For example, you might mention how many days after a one day heavy 
rainfall the river height is most affected. Or how many days it takes for the river height 
to return to the normal height once the rain stops. 


Channel equalization. We suppose that u1,...,Um is a signal (time series) that is trans- 
mitted (for example by radio). A receiver receives the signal y = c* u, where the n-vector 
c is called the channel impulse response. (See page 138.) In most applications n is small, 
e.g., under 10, and m is much larger. An equalizer is a k-vector h that satisfies hxc e1, 
the first unit vector of length n + k — 1. The receiver equalizes the received signal y by 
convolving it with the equalizer to obtain z = h * y. 


(a) How are z (the equalized received signal) and u (the original transmitted signal) 
related? Hint. Recall that h * (cx u) = (h*c) *u. 


(b) Numerical example. Generate a signal u of length m = 50, with each entry a random 
value that is either —1 or +1. Plot u and y = c* u, with c = (1,0.7,—0.3). Also 
plot the equalized signal z = hx y, with 


h = (0.9, —0.5, 0.5, —0.4, 0.3, —0.3, 0.2, —0.1). 
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Chapter 8 


Linear equations 


In this chapter we consider vector-valued linear and affine functions, and systems 
of linear equations. 


Linear and affine functions 


Vector-valued functions of vectors. The notation f : R” > R” means that f 
is a function that maps real n-vectors to real m-vectors. The value of the function 
f, evaluated at an n-vector x, is an m-vector f(x) = (fils), fo(x),..-, fm(2)). 
Each of the components fi of f is itself a scalar-valued function of x. As with 
scalar-valued functions, we sometimes write f(x) = f(#1,22,...,%n) to emphasize 
that f is a function of n scalar arguments. We use the same notation for each of 
the components of f, writing fi(x) = fi(£1, £2,..., 2n) to emphasize that f; is a 
function mapping the scalar arguments 21,...,% into a scalar. 


The matrix-vector product function. Suppose A is an m x n matrix. We can 
define a function f : R” > R” by f(a) = Az. The inner product function 
f : R” —> R, defined as f(x) = a™ zx, discussed in §2.1, is the special case with 
m=1. 


Superposition and linearity. The function f : R” > R™, defined by f(x) = Az, 
is linear, i.e., it satisfies the superposition property: 


flax + By) = af (x) + Bfly) (8.1) 


holds for all n-vectors x and y and all scalars a and £. It is a good exercise to 
parse this simple looking equation, since it involves overloading of notation. On 
the left-hand side, the scalar-vector multiplications az and Gy involve n-vectors, 
and the sum azg + 8y is the sum of two n-vectors. The function f maps n-vectors to 
m-vectors, so f(ax + By) is an m-vector. On the right-hand side, the scalar-vector 
multiplications and the sum are those for m-vectors. Finally, the equality sign is 
equality between two m-vectors. 
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We can verify that superposition holds for f using properties of matrix-vector 
and scalar-vector multiplication: 


flax + By) = Alaa + By) 

A(aa) + A(6y) 
a(Axr) + B(Ay) 
= af(x)+B6f(y) 


Thus we can associate with every matrix A a linear function f(x) = Az. 

The converse is also true. Suppose f is a function that maps n-vectors to m- 
vectors, and is linear, i.e., (8.1) holds for all n-vectors x and y and all scalars a 
and 8. Then there exists an m x n matrix A such that f(x) = Ax for all x. This 
can be shown in the same way as for scalar-valued functions in §2.1, by showing 
that if f is linear, then 


f(x) = xi f (e1) + t2f(e2) + +++ + Enf (en), (8.2) 


where ex is the kth unit vector of size n. The right-hand side can also be written 
as a matrix-vector product Ax, with 


A=[ f(e) flex) = fen) ]. 


The expression (8.2) is the same as (2.3), but here f(a) and f(e,) are vectors. The 
implications are exactly the same: A linear vector-valued function f is completely 
characterized by evaluating f at the n unit vectors €1,...,€n. 

As in §2.1 it is easily shown that the matrix-vector representation of a linear 
function is unique. If f : R” —> R” is a linear function, then there exists exactly 
one matrix A such that f(a) = Ax for all x. 


Examples of linear functions. In the examples below we define functions f that 
map n-vectors x to n-vectors f(x). Each function is described in words, in terms of 
its effect on an arbitrary x. In each case we give the associated matrix multiplication 
representation. 
e Negation. f changes the sign of x: f(x) = —a. 
Negation can be expressed as f(x) = Ax with A = —I. 


e Reversal. f reverses the order of the elements of x: f(x) = (@n,@n-1,---,21). 


The reversal function can be expressed as f(x) = Ax with 


A= 


(This is the n x n identity matrix with the order of its columns reversed. It 
is the reverser matrix introduced in §7.2.) 
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e Running sum. f forms the running sum of the elements in z: 


f(x) = (£1, £1 + £2, £1 + T2 + £3, ...,0, HE2 +++: + En). 


The running sum function can be expressed as f(x) = Ax with 


1 0 0 0 
1 1 0 0 
A=|: ; |, 
1 1 1 0 
1 1 1 1 


i.e., Ajj =1ifi> jand Ajj = 0 otherwise. This is the running sum matrix 
defined in (6.6). 


e De-meaning. f subtracts the mean from each entry of a vector x: f(x) = 
x — avg(x)l. 


The de-meaning function can be expressed as f(x) = Ax with 


1—1/n —1i/n -- -1/n 
p Fi a o Pa 
ijh sila E i=in 


Examples of functions that are not linear. Here we list some examples of func- 
tions f that map n-vectors x to n-vectors f(x) that are not linear. In each case 
we show a superposition counterexample. 


e Absolute value. f replaces each element of x with its absolute value: f(x) = 
(\v1|,|@2|,---,|@nl): 
The absolute value function is not linear. For example, with n = 1, x = 1, 
y = 0, a = —1, 8 = 0, we have 


flax + By) =1lAaf(xz) + 6f(y) =—-1, 


so superposition does not hold. 


e Sort. f sorts the elements of x in decreasing order. 


The sort function is not linear (except when n = 1, in which case f(x) = x). 
For example, if n = 2, x = (1,0), y = (0,1), a= 8 = 1, then 


flax + By) = (1,1) # af(a) + BF(y) = (2,0). 


Affine functions. A vector-valued function f : R” — R” is called affine if it can 
be expressed as f(x) = Ax +b, where A is an m x n matrix and b is an m-vector. 
It can be shown that a function f : R” > R” is affine if and only if 


flax + By) = af (x) + BF(y) 
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holds for all n-vectors x, y, and all scalars a, 8 that satisfy a+ 8 = 1. In other 
words, superposition holds for affine combinations of vectors. (For linear functions, 
superposition holds for any linear combinations of vectors.) 


The matrix A and the vector b in the representation of an affine function as 
f(x) = Ax +b are unique. These parameters can be obtained by evaluating f at 
the vectors 0, €1,..., €n, where ep is the kth unit vector in R”. We have 


A=[ f(e1)—f@) fle2)— FO) -> flen)—f(O)], b= F(0). 


Just like affine scalar-valued functions, affine vector-valued functions are often 
called linear, even though they are linear only when the vector b is zero. 


Linear function models 


Many functions or relations between variables that arise in natural science, engi- 
neering, and social sciences can be approximated as linear or affine functions. In 
these cases we refer to the linear function relating the two sets of variables as a 
model or an approximation, to remind us that the relation is only an approximation, 
and not exact. We give a few examples here. 


e Price elasticity of demand. Consider n goods or services with prices given by 
the n-vector p, and demands for the goods given by the n-vector d. A change 
in prices will induce a change in demands. We let puree be the n-vector 
that gives the fractional change in the prices, i.e., d?'° = (pRe’ — pi)/pi, 

where p™” is the n-vector of new (changed) prices. We let 6¢™ be the n- 

vector that gives the fractional change in the product demands, i.e., 6¢°™ = 

(dpew — d;)/d;, where d"°” is the n-vector of new demands. A linear demand 

elasticity model relates these vectors as 5¢¢™ = E%§Price, where E4 is the nxn 

demand elasticity matrix. For example, suppose E?, = —0.4 and EJ = 0.2. 

This means that a 1% increase in the price of the first good, with other 

prices kept the same, will cause demand for the first good to drop by 0.4%, 

and demand for the second good to increase by 0.2%. (In this example, the 

second good is acting as a partial substitute for the first good.) 


e Elastic deformation. Consider a steel structure like a bridge or the structural 
frame of a building. Let f be an n-vector that gives the forces applied to 
the structure at n specific places (and in n specific directions), sometimes 
called a loading. The structure will deform slightly due to the loading. Let 
d be an m-vector that gives the displacements (in specific directions) of m 
points in the structure, due to the load, e.g., the amount of sag at a specific 
point on a bridge. For small displacements, the relation between displacement 
and loading is well approximated as linear: d = Cf, where C is the m x n 
compliance matrix. The units of the entries of C are m/N. 


8.2.1 


8.2.2 
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Taylor approximation 


Suppose f : R” + R” is differentiable, i.e., has partial derivatives, and z is an 
n-vector. The first-order Taylor approximation of f near z is given by 


fa = fle) + F2.@@— 2) ++ F2 Cen za) 


= fi(z)+ Vfi(z)" (2-2), 


for i =1,...,m. (This is just the first-order Taylor approximation of each of the 
scalar-valued functions f;, described in §2.2.) For x near z, f(x) is a very good 
approximation of f(x). We can express this approximation in compact notation, 
using matrix-vector multiplication, as 


f(z) = f(z) + Df(2)(e- 2), (8.3) 


where the m x n matrix Df(z) is the derivative or Jacobian matrix of f at z 
(see §C.1). Its components are the partial derivatives of f, 


ofi 
DIO = FE) i=1,...,m, J=l,...,n, 
evaluated at the point z. The rows of the Jacobian are Vfi(z)", for i=1,...,m. 


The Jacobian matrix is named for the mathematician Carl Gustav Jacob Jacobi. 

As in the scalar-valued case, Taylor approximation is sometimes written with a 
second argument as f (x; z) to show the point z around which the approximation is 
made. Evidently the Taylor series approximation f is an affine function of x. (It is 
often called a linear approximation of f, even though it is not, in general, a linear 
function.) 


Regression model 


Recall the regression model (2.7) 
g=27 B+, (8.4) 


where the n-vector x is a feature vector for some object, 8 is an n-vector of weights, 
v is a constant (the offset), and ĝ is the (scalar) value of the regression model 
prediction. 

Now suppose we have a set of N objects (also called samples or examples), with 
feature vectors x), ..., £). The regression model predictions associated with the 
examples are given by 


g = (ce)? B+, i=1,...,N. 


These numbers usually correspond to predictions of the value of the outputs or 
responses. If in addition to the example feature vectors x we are also given the 
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actual value of the associated response variables, y“,...,y), then our prediction 
errors or residuals are 


r® = y — 
(Some authors define the prediction errors as gf — y©.) 

We can express this using compact matrix-vector notation. We form the n x N 
feature matrix X with columns 2), ...,a°%). We let y4 denote the N-vector whose 
entries are the actual values of the response for the N examples. (The superscript 
‘d’ stands for ‘data’.) We let ĝt denote the N-vector of regression model predictions 
for the N examples, and we let r denote the N-vector of residuals or prediction 
errors. We can then express the regression model predictions for this data set in 
matrix-vector form as 


gt = XTB +v1. 


The vector of N prediction errors for the examples is given by 


rì = yt — ĝt = yi — XTB -1. 


We can include the offset v in the regression model by including an additional 
feature equal to one as the first entry of each feature vector: 


T 
a i7 v oo = 
[| [a522 
where X is the new feature matrix, with a new first row of ones, and 8 = (v, 8) is 
the vector of regression model parameters. This is often written without the tildes, 
as ĝt = XTB, by simply including the feature one as the first feature. 
The equation above shows that the N-vector of predictions for the N examples 


is a linear function of the model parameters (v, 3). The N-vector of prediction 
errors is an affine function of the model parameters. 


Systems of linear equations 


Consider a set (also called a system) of m linear equations in n variables or un- 
knowns 21,...,2n: 


Aix + Ajotg t+: + Ainin = b 
Agi X1 + A2282 +: + Aontn = be 
Am121 + Am2@2 Se Amn&n = bm. 


The numbers A;; are called the coefficients in the linear equations, and the numbers 
b; are called the right-hand sides (since by tradition, they appear on the right-hand 
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side of the equation). These equations can be written succinctly in matrix notation 
as 

Ax = b. (8.5) 
In this context, the m x n matrix A is called the coefficient matriz, and the m- 
vector b is called the right-hand side. An n-vector x is called a solution of the 


linear equations if Ax = b holds. A set of linear equations can have no solutions, 
one solution, or multiple solutions. 


Examples. 
e The set of linear equations 
zı +z2=1, zı =—1l, 41-24%. =0 


is written as Ax = b with 


1 1 1 
A=|1 0l, b=] -1 
1 -1 0 


It has no solutions. 
e The set of linear equations 
t+%2.=1, £2 +2z3=2 


is written as Ax = b with 


1 1 0 1 
loaa i 
It has multiple solutions, including x = (1,0,2) and æ = (0,1,1). 


Over-determined and under-determined systems of linear equations. The set 
of linear equations is called over-determined if m > n, under-determined if m < n, 
and square if m = n; these correspond to the coefficient matrix being tall, wide, 
and square, respectively. When the system of linear equations is over-determined, 
there are more equations than variables or unknowns. When the system of linear 
equations is under-determined, there are more unknowns than equations. When 
the system of linear equations is square, the numbers of unknowns and equations 
is the same. A set of equations with zero right-hand side, Ax = 0, is called a 
homogeneous set of equations. Any homogeneous set of equations has x = 0 as a 
solution. 

In chapter 11 we will address the question of how to determine if a system of 
linear equations has a solution, and how to find one when it does. For now, we 
give a few interesting examples. 
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Examples 


Coefficients of linear combinations. Let a,,...,@, denote the columns of A. The 
system of linear equations Az = b can be expressed as 


£101 +++: + Enan = b, 


i.e., b is a linear combination of aj,...,@,, with coefficients z1,..., £n. So solving 
Ax = b is the same as finding coefficients that express b as a linear combination of 
the vectors a1,...,@n. 


Polynomial interpolation. We seek a polynomial p of degree at most n — 1 that 
interpolates a set of m given points (t;, yi), i = 1,...,m. (This means that p(t;) = 
yi.) We can express this as a set of m linear equations in the n unknowns c, where 
c is the n-vector of coefficients: Ac = y. Here the matrix A is the Vandermonde 
matrix (6.7), and the vector c is the vector of polynomial coefficients, as described 
in the example on page 120. 


Balancing chemical reactions. A chemical reaction involves p reactants (molecules) 
and q products, and can be written as 


ay say — bP, +- + bg P). 


Here Rı,..., Rp are the reactants, P,,...,P, are the products, and the numbers 
a1,...,Qp and by,...,b6, are positive numbers that tell us how many of each of 
these molecules is involved in the reaction. They are typically integers, but can be 
scaled arbitrarily; we could double all of these numbers, for example, and we still 
have the same reaction. As a simple example, we have the electrolysis of water, 


2H20 = 2Hə + Ono, 


which has one reactant, water (H2O), and two products, molecular hydrogen (H2) 
and molecular oxygen (O2). The coefficients tell us that 2 water molecules create 
2 hydrogen molecules and 1 oxygen molecule. The coefficients in a reaction can 
be multiplied by any nonzero numbers; for example, we could write the reaction 
above as 3H2O —> 3H2 + (3/2)O2. By convention reactions are written with all 
coefficients integers, with least common divisor one. 

In a chemical reaction the numbers of constituent atoms must balance. This 
means that for each atom appearing in any of the reactants or products, the total 
amount on the left-hand side must equal the total amount on the right-hand side. 
(If any of the reactants or products is charged, i.e., an ion, then the total charge 
must also balance.) In the simple water electrolysis reaction above, for example, 
we have 4 hydrogen atoms on the left (2 water molecules, each with 2 hydrogen 
atoms), and 4 on the right (2 hydrogen molecules, each with 2 hydrogen atoms). 
The oxygen atoms also balance, so this reaction is balanced. 

Balancing a chemical reaction with specified reactants and products, i.e., find- 
ing the numbers a1,...,@p and b;,...,bg, can be expressed as a system of linear 
equations. We can express the requirement that the reaction balances as a set of 
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m equations, where m is the number of different atoms appearing in the chemical 
reaction. We define the m x p matrix R by 


Rij = number of atoms of type i in Rj, t=1,...,m, j=1,...,p. 


(The entries of R are nonnegative integers.) The matrix R is interesting; for ex- 
ample, its jth column gives the chemical formula for reactant R;. We let a denote 
the p-vector with entries a1,...,@p). Then, the m-vector Ra gives the total number 
of atoms of each type appearing in the reactants. We define an m x q matrix P 
in a similar way, so the m-vector Pb gives the total number of atoms of each type 
that appears in the products. 

We write the balance condition using vectors and matrices as Ra = Pb. We 


can express this as 
a 
[R -P] | 5 | =i 


which is a set of m homogeneous linear equations. 

A simple solution of these equations is a = 0, b = 0. But we seek a nonzero 
solution. We can set one of the coefficients, say a1, to be one. (This might cause 
the other quantities to be fractional-valued.) We can add the condition that a; = 1 
to our system of linear equations as 


R -P aj 
eT o0 b | omti 


Finally, we have a set of m + 1 equations in p + q variables that expresses the 
requirement that the chemical reaction balances. Finding a solution of this set of 
equations is called balancing the chemical reaction. 

For the example of electrolysis of water described above, we have p = 1 reac- 
tant (water) and g = 2 products (molecular hydrogen and oxygen). The reaction 
involves m = 2 atoms, hydrogen and oxygen. The reactant and product matrices 


are 
2 2 0 


The balancing equations are then 


2 —2 0 ay 0 
1 0 —2 w | =| 0 
1 0 0 b2 1 


These equations are easily solved, and have the solution (1,1,1/2). (Multiplying 
these coefficients by 2 gives the reaction given above.) 


Diffusion systems. A diffusion system is a common model that arises in many 
areas of physics to describe flows and potentials. We start with a directed graph 
with n nodes and m edges. (See §6.1.) Some quantity (like electricity, heat, energy, 
or mass) can flow across the edges, from one node to another. 

With edge j we associate a flow (rate) fj, which is a scalar; the vector of all 
m flows is the flow m-vector f. The flows f; can be positive or negative: Positive 
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Figure 8.1 A node in a diffusion system with label 1, exogenous flow sı and 
three incident edges. 


f; means the quantity flows in the direction of edge j, and negative fj means 
the quantity flows in the opposite direction of edge j. The flows can represent, for 
example, heat flow (in units of Watts) in a thermal model, electrical current (Amps) 
in an electrical circuit, or movement (diffusion) of mass (such as, for example, a 
pollutant). We also have a source (or exogenous) flow s; at each node, with s; > 0 
meaning that an exogenous flow is injected into node i, and s; < 0 means that 
an exogenous flow is removed from node i. (In some contexts, a node where flow 
is removed is called a sink.) In a thermal system, the sources represent thermal 
(heat) sources; in an electrical circuit, they represent electrical current sources; in 
a system with diffusion, they represent external injection or removal of the mass. 

In a diffusion system, the flows must satisfy (flow) conservation, which means 
that at each node, the total flow entering each node from adjacent edges and the 
exogenous source, must be zero. This is illustrated in figure 8.1, which shows three 
edges adjacent to node 1, two entering node 1 (flows 1 and 2), and one (flow 3) 
leaving node 1, and an exogenous flow. Flow conservation at this node is expressed 
as 

fit fo-—fst+tsi =0. 


Flow conservation at every node can be expressed by the simple matrix-vector 
equation 
Af+s=0, (8.6) 


where A is the incidence matrix described in §7.3. (This is called Kirchhoff’s 
current law in an electrical circuit, after the physicist Gustav Kirchhoff; when the 
flows represent movement of mass, it is called conservation of mass.) 

With node 7 we associate a potential e;; the n-vector e gives the potential 
at all nodes. (Note that here, e represents the n-vector of potentials; e; is the 
scalar potential at node 7, and not the standard ith unit vector.) The potential 
might represent the node temperature in a thermal model, the electrical potential 
(voltage) in an electrical circuit, and the concentration in a system that involves 
mass diffusion. 
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mos 
Figure 8.2 The flow through edge 8 is equal to fg = (e2 — e3)/rs. 


In a diffusion system the flow on an edge is proportional to the potential differ- 
ence across its adjacent nodes. This is typically written as r;f; = ex — e1, where 
edge j goes from node k to node l, and r; (which is typically positive) is called the 
resistance of edge j. In a thermal model, r; is called the thermal resistance of the 
edge; in an electrical circuit, it is called the electrical resistance. This is illustrated 
in figure 8.2, which shows edge 8, connecting node 2 and node 3, corresponding to 
an edge flow equation 


rg fs = e2 — €3. 


We can write the edge flow equations in a compact way as 
Rf =—Ate, (8.7) 


where R = diag(r) is called the resistance matriz. 
The diffusion model can be expressed as one set of block linear equations in the 
variables f, s, and e: 
A I 0 f -o 
Ro ae | |e 
e 
This is a set of n +m homogeneous equations in m + 2n variables. To these under- 


determined equations we can add others, for example, by specifying some of the 
entries of f, s, and e. 


Leontief input-ouput model. We consider an economy with n different industrial 
sectors. We let x; be the economic activity level, or total production output, of 
sector i, for i = 1,...,n, measured in a common unit, such as (billions of) dollars. 
The output of each sector flows to other sectors, to support their production, and 
also to consumers. We denote the total consumer demand for sector i as d;, for 
i= 1,... n. 

Supporting the output level x; for sector j requires Aj;x; output for sector t. 
We refer to A;jx; as the sector i input that flows to sector j. (We can have Ai Æ 0; 
for example, it requires some energy to support the production of energy.) Thus, 
Ayr, +-+- + Aint is the total sector i output required by, or flowing into, the n 
industrial sectors. The matrix A is called the input-output matrix of the economy, 
since it describes the flows of sector outputs to the inputs of itself and other sectors. 
The vector Ax gives the sector outputs required to support the production levels 
given by x. (This sounds circular, but isn’t.) 
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Finally, we require that for each sector, the total production level matches the 
demand plus the total amount required to support production. This leads to the 
balance equations, 

x= Ax+d. 


Suppose the demand vector d is given, and we wish to find the sector output levels 
that will support it. We can write this as a set of n equations in n unknowns, 


(I — A)x =d. 


This model of the sector inputs and outputs of an economy was developed by 
Wassily Leontief in the late 1940s, and is now known as Leontief input-output 
analysis. He was awarded the Nobel Prize in economics for this work in 1973. 
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Exercises 


Sum of linear functions. Suppose f : R” > R” and g : R” — R” are linear functions. 
Their sum is the function h : R” > R”, defined as h(x) = f(x)+ g(x) for any n-vector z. 
The sum function is often denoted as h = f+g. (This is another case of overloading the + 
symbol, in this case to the sum of functions.) If f has matrix representation f(x) = Fa, 
and g has matrix representation f(x) = Ga, where F and G are m x n matrices, what 
is the matrix representation of the sum function h = f + g? Be sure to identify any + 
symbols appearing in your justification. 


Averages and affine functions. Suppose that G : R” — R” is an affine function. Let 
%1,...,2~ be n-vectors, and define the m-vectors yı = G(a1),...,yx% = G(x). Let 


B= (eit-:-+an)/k, G=(yt---+ye)/k 


be the averages of these two lists of vectors. (Here % is an n-vector and y is an m-vector.) 
Show that we always have y = G(T). In words: The average of an affine function applied 
to a list of vectors is the same as the affine function applied to the average of the list of 
vectors. 


Cross-product. The cross product of two 3-vectors a = (a1, a2,a3) and x = (#1, £2, £3) is 
defined as the vector 
a2%3 — A3%2 
axzr= a3X%1 — a1x3 
a1%2 — Q271 


The cross product comes up in physics, for example in electricity and magnetism, and in 
dynamics of mechanical systems like robots or satellites. (You do not need to know this 
for this exercise.) 

Assume a is fixed. Show that the function f(x) = a x x is a linear function of x, by giving 
a matrix A that satisfies f(x) = Az for all x. 


Linear functions of images. In this problem we consider several linear functions of a 
monochrome image with N x N pixels. To keep the matrices small enough to work out 
by hand, we will consider the case with N = 3 (which would hardly qualify as an image). 
We represent a 3 x 3 image as a 9-vector using the ordering of pixels shown below. 


1 4 7 
2 5 8 
3 6 9 
(This ordering is called column-major.) Each of the operations or transformations below 


defines a function y = f(x), where the 9-vector x represents the original image, and the 
9-vector y represents the resulting or transformed image. For each of these operations, 
give the 9 x 9 matrix A for which y = Az. 


(a) Turn the original image x upside-down. 
(b) Rotate the original image x clockwise 90°. 


(c) Translate the image up by 1 pixel and to the right by 1 pixel. In the translated 
image, assign the value y; = 0 to the pixels in the first column and the last row. 


(d) Set each pixel value y; to be the average of the neighbors of pixel 7 in the original 
image. By neighbors, we mean the pixels immediately above and below, and imme- 
diately to the left and right. The center pixel has 4 neighbors; corner pixels have 2 
neighbors, and the remaining pixels have 3 neighbors. 
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Symmetric and anti-symmetric part. An n-vector x is symmetric if £k = Ln—K+1 for 
k=1,...,n. It is anti-symmetric if 1k = —£n-k+1 for k =1,...,n. 


(a) Show that every vector x can be decomposed in a unique way as a sum g£ = £s + La 
of a symmetric vector x; and an anti-symmetric vector Za. 


(b) Show that the symmetric and anti-symmetric parts xs and za are linear functions 
of x. Give matrices A, and Aa such that zs = Asx and za = Aag for all z. 


Linear functions. For each description of y below, express it as y = Ax for some A. (You 
should specify A.) 


(a) yi is the difference between x; and the average of 71,...,2:-1. (We take yı = 21.) 


(b) yi is the difference between x; and the average value of all other «js, i.e., the average 
OF Bits nh BIN, A AE EEI is 


Interpolation of polynomial values and derivatives. The 5-vector c represents the coeffi- 
cients of a quartic polynomial p(x) = c1 + c2% +4 e307 + cax® +c5x*. Express the conditions 


p(0)=0, p'(0)=0, p(l)=1, p’(1) =0, 


as a set of linear equations of the form Ac = b. Is the system of equations under- 
determined, over-determined, or square? 


Interpolation of rational functions. A rational function of degree two has the form 


FO o a + cot + c3t? 
-~ 1+dyt+ dt?’ 


where ci, C2,¢3,d1,d2 are coefficients. (‘Rational’ refers to the fact that f is a ratio of 
polynomials. Another name for f is bi-quadratic.) Consider the interpolation conditions 


f(t) = yi, i=1,...,K, 


where t; and y; are given numbers. Express the interpolation conditions as a set of linear 
equations in the vector of coefficients 0 = (c1, c2, c3, d1, d2), as AO = b. Give A and b, and 
their dimensions. 


Required nutrients. We consider a set of n basic foods (such as rice, beans, apples) and 
a set of m nutrients or components (such as protein, fat, sugar, vitamin C). Food 7 has 
a cost given by cj (say, in dollars per gram), and contains an amount N;; of nutrient i 
(per gram). (The nutrients are given in some appropriate units, which can depend on 
the particular nutrient.) A daily diet is represented by an n-vector d, with d; the daily 
intake (in grams) of food i. Express the condition that a diet d contains the total nutrient 
amounts given by the m-vector n, and has a total cost B (the budget) as a set of 
linear equations in the variables d),...,d,. (The entries of d must be nonnegative, but 
we ignore this issue here.) 


Blending crude oil. A set of K different types of crude oil are blended (mixed) together in 
proportions 6;,...,0«. These numbers sum to one; they must also be nonnegative, but we 
will ignore that requirement here. Associated with crude oil type k is an n-vector cp that 
gives its concentration of n different constituents, such as specific hydrocarbons. Find a 
set of linear equations on the blending coefficients, AO = b, that expresses the requirement 
that the blended crude oil achieves a target set of constituent concentrations, given by 
the n-vector c°". (Include the condition that 0: +---+6% = 1 in your equations.) 


Location from range measurements. The 3-vector x represents a location in 3-D. We 
measure the distances (also called the range) of x to four points at known locations a1, 
a2, 43, Ga: 


pı = |z- al, — p2 = |z- azl, ps = ||z— asl, pa = |£ — aa]. 


Express these distance conditions as a set of three linear equations in the vector x. Hint. 
Square the distance equations, and subtract one from the others. 
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Quadrature. Consider a function f : R —> R. We are interested in estimating the definite 
integral a = Ji f(x) dx based on the value of f at some points tı, ..., tn. (We typically 


have —1 < tı < t2 < --- < tn < 1, but this is not needed here.) The standard method for 
estimating a is to form a weighted sum of the values f (ti): 


â= wif (ti) +---+ wrf (tn), 


where â is our estimate of a, and w1,...,Wn are the weights. This method of estimating 
the value of an integral of a function from its values at some points is a classical method in 
applied mathematics called quadrature. There are many quadrature methods (i.e., choices 
of the points t; and weights w;). The most famous one is due to the mathematician Carl 
Friedrich Gauss, and bears his name. 


(a) A typical requirement in quadrature is that the approximation should be exact (i.e., 
â = a) when f is any polynomial up to degree d, where d is given. In this case 
we say that the quadrature method has order d. Express this condition as a set of 
linear equations on the weights, Aw = b, assuming the points tı,...,tn are given. 
Hint. If â = a holds for the specific cases f(x) = 1, f(x) = z, ..., f(x) = xt, then 
it holds for any polynomial of degree up to d. 


(b) Show that the following quadrature methods have order 1, 2, and 3 respectively. 


e Trapezoid rule: n = 2, tı 1, t2 = 1, and 


wı =1/2, w= 1/2. 


e Simpson’s rule: n = 3, ti 1, t2 = 0, t3 = 1, and 


wi = 1/3, w2 = 4/3, w3 = 1/3. 


(Named after the mathematician Thomas Simpson.) 
e Simpson’s 3/8 rule: n = 4, tı 1, t2 1/3, t3 = 1/3, ta = 1, 


wi = 1/4, w2 = 3/4, w3 = 3/4, wa = 1/4. 


Portfolio sector exposures. (See exercise 1.14.) The n-vector h denotes a portfolio of 
investments in n assets, with h; the dollar value invested in asset i. We consider a set 
of m industry sectors, such as pharmaceuticals or consumer electronics. Each asset is 
assigned to one of these sectors. (More complex models allow for an asset to be assigned 
to more than one sector.) The exposure of the portfolio to sector i is defined as the sum 
of investments in the assets in that sector. We denote the sector exposures using the 
m-vector s, where s; is the portfolio exposure to sector i. (When s; = 0, the portfolio 
is said to be neutral to sector i.) An investment advisor specifies a set of desired sector 
exposures, given as the m-vector s¢°*. Express the requirement s = s1% as a set of linear 
equations of the form Ah = b. (You must describe the matrix A and the vector b.) 
Remark. A typical practical case involves n = 1000 assets and m = 50 sectors. An advisor 
might specify sł = 0 if she does not have an opinion as how companies in that sector 
will do in the future; she might specify a positive value for s?°° if she thinks the companies 
in that sector will do well (i.e., generate positive returns) in the future, and a negative 
value if she thinks they will do poorly. 


Affine combinations of solutions of linear equations. Consider the set of m linear equations 
in n variables Ax = b, where A is an m x n matrix, b is an m-vector, and g is the n-vector 
of variables. Suppose that the n-vectors 271,...,Z are solutions of this set of equations, 
i.e., satisfy Az; = b. Show that if the coefficients ai,...,ax satisfy aj +--- +a, = 1, 
then the affine combination 

w = Q121 +: + Akk 


is a solution of the linear equations, i.e., satisfies Aw = b. In words: Any affine combina- 
tion of solutions of a set of linear equations is also a solution of the equations. 


162 


8 Linear equations 


8.15 


8.16 


Stoichiometry and equilibrium reaction rates. We consider a system (such as a single cell) 
containing m metabolites (chemical species), with n reactions among the metabolites 
occurring at rates given by the n-vector r. (A negative reaction rate means the reaction 
runs in reverse.) Each reaction consumes some metabolites and produces others, in known 
rates proportional to the reaction rate. This is specified in the m x n stoichiometry matrix 
S, where Sij is the rate of metabolite i production by reaction j, running at rate one. 
(When Si; is negative, it means that when reaction j runs at rate one, metabolite 7 is 
consumed.) The system is said to be in equilibrium if the total production rate of each 
metabolite, due to all the reactions, is zero. This means that for each metabolite, the 
total production rate balances the total consumption rate, so the total quantities of the 
metabolites in the system do not change. Express the condition that the system is in 
equilibrium as a set of linear equations in the reaction rates. 


Bi-linear interpolation. We are given a scalar value at each of the four corners of a square 
in 2-D, (21,41), (£1, y2), (£2, y1), and (x2, y2), where zı < z2 and yı < y2. We refer to 
these four values as F31, Fie, F21, and F22, respectively. A bi-linear interpolation is a 
function of the form 


f(u, v) = 01 + bzu + O30 + Osur, 


where 01,...,04 are coefficients, that satisfies 


f(@i,y)=Fu, f(ti,y2)=Fi2, f(t2,y1) = Fa, f (x2, y2) = F22, 


i.e., it agrees with (or interpolates) the given values on the four corners of the square. 
(The function f is usually evaluated only for points (u,v) inside the square. It is called 
bi-linear since it is affine in u when v is fixed, and affine in v when u is fixed.) 

Express the interpolation conditions as a set of linear equations of the form A0 = b, where 
A is a 4 x 4 matrix and b is a 4-vector. Give the entries of A and b in terms of x1, x2, Y1, 
y2, Fii, Fis, Fai, and Foo. 

Remark. Bi-linear interpolation is used in many applications to guess or approximate the 
values of a function at an arbitrary point in 2-D, given the function values on a grid of 
points. To approximate the value at a point (x,y), we first find the square of grid points 
that the point lies in. Then we use bi-linear interpolation to get the approximate value 


at (x,y). 
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Chapter 9 


Linear dynamical systems 


In this chapter we consider a useful application of matrix-vector multiplication, 
which is used to describe many systems or phenomena that change or evolve over 
time. 


Linear dynamical systems 


Suppose x1, 22,... is a sequence of n-vectors. The index (subscript) denotes time 
or period, and is written as t; x+, the value of the sequence at time (or period) 
t, is called the state at time t. We can think of a; as a vector that changes over 
time, i.e., one that changes dynamically. In this context, the sequence 71, %2,... is 
sometimes called a trajectory or state trajectory. We sometimes refer to x; as the 
current state of the system (implicitly assuming the current time is t), and 2441 as 
the next state, x, as the previous state, and so on. 

The state x; can represent a portfolio that changes daily, or the positions and 
velocities of the parts of a mechanical system, or the quarterly activity of an econ- 
omy. If x; represents a portfolio that changes daily, (x5) is the amount of asset 3 
held in the portfolio on (trading) day 5. 

A linear dynamical system is a simple model for the sequence, in which each 
441 is a linear function of z+: 


Tt+1 = Arr, t= 1D ates (9.1) 


Here the n x n matrices A; are called the dynamics matrices. The equation above 
is called the dynamics or update equation, since it gives us the next value of x, i.e., 
141, as a function of the current value x+. Often the dynamics matrix does not 
depend on t, in which case the linear dynamical system is called time-invariant. 
If we know z; (and At, Azii,...) we can determine 2741, £t+2,... simply by 
iterating the dynamics equation (9.1). In other words: If we know the current 
value of x, we can find all future values. In particular, we do not need to know 
the past states. This is why 2; is called the state of the system. It contains all the 
information needed at time t to determine the future evolution of the system. 
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Linear dynamical system with input. ‘There are many variations on and exten- 
sions of the basic linear dynamical system model (9.1), some of which we will 
encounter later. As an example, we can add additional terms to the update equa- 
tion: 

Tt+1 = Apart + Brut + Ct, t= 1.2544. (9.2) 


Here uz is an m-vector called the input, B, is the n x m input matrix, and the 
n-vector c; is called the offset, all at time t. The input and offset are used to model 
other factors that affect the time evolution of the state. Another name for the 
input uz is exogenous variable, since, roughly speaking, it comes from outside the 
system. 


Markov model. The linear dynamical system (9.1) is sometimes called a Markov 
model (after the mathematician Andrey Markov). Markov studied systems in which 
the next state value depends on the current one, and not on the previous state 
values 14-1, %4-2,-... The linear dynamical system (9.1) is the special case of a 
Markov system where the next state is a linear function of the current state. 

In a variation on the Markov model, called a (linear) K-Markov model, the 
next state x,,, depends on the current state and K — 1 previous states. Such a 
system has the form 


try = Aya, +--+ + ÅKTt-K41, t= K,K +1,.... (9.3) 


Models of this form are used in time series analysis and econometrics, where they 
are called (vector) auto-regressive models. When K = 1, the Markov model (9.3) is 
the same as a linear dynamical system (9.1). When K > 1, the Markov model (9.3) 
can be reduced to a standard linear dynamical system (9.1), with an appropriately 
chosen state; see exercise 9.4. 


Simulation. If we know the dynamics (and input) matrices, and the state at 
time t, we can find the future state trajectory £441, £t+2;..- by iterating the equa- 
tion (9.1) (or (9.2), provided we also know the input sequence ur, ut+1,.».-). This is 
called simulating the linear dynamical system. Simulation makes predictions about 
the future state of a system. (To the extent that (9.1) is only an approximation or 
model of some real system, we must be careful when interpreting the results.) We 
can carry out what-if simulations, to see what would happen if the system changes 
in some way, or if a particular set of inputs occurs. 


Population dynamics 


Linear dynamical systems can be used to describe the evolution of the age dis- 
tribution in some population over time. Suppose x; is a 100-vector, with (z+); 
denoting the number of people in some population (say, a country) with age i — 1 
(say, on January 1) in year t, where t is measured starting from some base year, for 
i=1,...,100. While (z+); is an integer, it is large enough that we simply consider 
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Figure 9.1 Age distribution in the US in 2010. (United States Census Bu- 
reau, census.gov). 


it a real number. In any case, our model certainly is not accurate at the level of 
individual people. Also, note that the model does not track people 100 and older. 
The distribution of ages in the US in 2010 is shown in figure 9.1. 

The birth rate is given by a 100-vector b, where b; is the average number of 
births per person with age i— 1, i = 1,...,100. (This is half the average number 
of births per woman with age i — 1, assuming equal numbers of men and women 
in the population.) Of course b; is approximately zero for i < 13 andi > 50. The 
approximate birth rates for the US in 2010 are shown in figure 9.2. The death rate 
is given by a 100-vector d, where d; is the portion of those aged i — 1 who will die 
this year. The death rates for the US in 2010 are shown in figure 9.3. 

To derive the dynamics equation (9.1), we find 241 in terms of z+, taking into 
account only births and deaths, and not immigration. The number of 0-year olds 
next year is the total number of births this year: 


(£t+1)1 = bT xy. 


The number of i-year olds next year is the number of (i — 1)-year-olds this year, 
minus those who die: 


(Ver1)é44 = (1 — d;)(xz);, 1 = 1,...,99. 
We can assemble these equations into the time-invariant linear dynamical system 


Tt+1 = Ar, t= i eee (9.4) 
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Figure 9.2 Approximate birth rate versus age in the US in 2010. The figure is 
based on statistics for age groups of five years (hence, the piecewise-constant 
shape) and assumes an equal number of men and women in each age group. 
(Martin J.A., Hamilton B.E., Ventura S.J. et al., Births: Final data for 2010. 
National Vital Statistics Reports; vol. 61, no. 1. National Center for Health 
Statistics, 2012.) 
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Figure 9.3 Death rate versus age, for ages 0-99, in the US in 2010. (Centers 


for Disease Control and Prevention, National Center for Health Statistics, 
wonder.cdc.gov. ) 
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Figure 9.4 Predicted age distribution in the US in 2020. 


where A is given by 


by b2 bs = bos bog b100 
1— dı 0 0 =: 0 0 0 
0 1-do 0 ::: 0 0 0 
A= . : . 
0 0 Oat l-dg 0 0 
0 0 QO: 0 1 — dogg 0 


We can use this model to predict the total population in 10 years (not including 
immigration), or to predict the number of school age children, or retirement age 
adults. Figure 9.4 shows the predicted age distribution in 2020, computed by 
iterating the model 2,44; = Aa, for t = 1,...,10, with initial value x, given by 
the 2010 age distribution of figure 9.1. Note that the distribution is based on an 
approximate model, since we neglect the effect of immigration, and assume that the 
death and birth rates remain constant and equal to the values shown in figures 9.2 
and 9.3. 

Population dynamics models are used to carry out projections of the future age 
distribution, which in turn is used to predict how many retirees there will be in 
some future year. They are also used to carry out various ‘what if’ analyses, to 
predict the effect of changes in birth or death rates on the future age distribution. 

It is easy to include the effects of immigration and emigration in the population 
dynamics model (9.4), by simply adding a 100-vector uz: 


Lt+1 = Ax, + Ut; 
which is a time-invariant linear dynamical system of the form (9.2), with input wu; 


and B = I. The vector u; gives the net immigration in year t over all ages; (uz); 
is the number of immigrants in year t of age i — 1. (Negative entries mean net 


emigration.) 
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Epidemic dynamics 


The dynamics of infection and the spread of an epidemic can be modeled using a 
linear dynamical system. (More sophisticated nonlinear epidemic dynamic models 
are also used.) In this section we describe a simple example. 

A disease is introduced into a population. In each period (say, days) we count 
the fraction of the population that is in four different infection states: 


e Susceptible. These individuals can acquire the disease the next day. 
e Infected. These individuals have the disease. 


e Recovered (and immune). These individuals had the disease and survived, 
and now have immunity. 


e Deceased. These individuals had the disease, and unfortunately died from it. 


We denote the fractions of each of these as a 4-vector x;, so, for example, 7; = 
(0.75, 0.10, 0.10, 0.05) means that in day t, 75% of the population is susceptible, 
10% is infected, 10% is recovered and immune, and 5% has died from the disease. 

There are many mathematical models that predict how the disease state frac- 
tions x, evolve over time. One simple model can be expressed as a linear dynamical 
system. The model assumes the following happens over each day. 


e 5% of the susceptible population will acquire the disease. (The other 95% 
will remain susceptible.) 


e 1% of the infected population will die from the disease, 10% will recover 
and acquire immunity, and 4% will recover and not acquire immunity (and 
therefore, become susceptible). The remaining 85% will remain infected. 


(Those who have have recovered with immunity and those who have died remain 
in those states.) 

We first determine (2441)1, the fraction of susceptible individuals in the next 
day. These include the susceptible individuals from today, who did not become 
infected, which is 0.95(x+)ı, plus the infected individuals today who recovered 
without immunity, which is 0.04(2;)2. All together we have (#441)1 = 0.95(a4)1 + 
0.04(x4)2. We have (a441)2 = 0.85(24)2 + 0.05(x+)1; the first term counts those 
who are infected and remain infected, and the second term counts those who are 
susceptible and acquire the disease. Similar arguments give (%41)3 = (£:)3 + 
0.10(xze)2, and (a441)4 = (a4)4 + 0.01(x4)2. We put these together to get 


0.95 0.04 0 0 
= | 0.05 0.85 0 0 
m=) 9 O10 1 0 
0 0.01 0 1 


Tt, 


which is a time-invariant linear dynamical system of the form (9.1). 

Figure 9.5 shows the evolution of the four groups from the initial condition zy) = 
(1,0,0,0). The simulation shows that after around 100 days, the state converges 
to one with a little under 10% of the population deceased, and the remaining 
population immune. 
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Figure 9.5 Simulation of epidemic dynamics. 


0 p(T) 


Figure 9.6 Mass moving along a line. 


Motion of a mass 


Linear dynamical systems can be used to (approximately) describe the motion of 
many mechanical systems, for example, an airplane (that is not undergoing extreme 
maneuvers), or the (hopefully not too large) movement of a building during an 
earthquake. Here we describe the simplest example: A single mass moving in 1-D 
(i.e., a straight line), with an external force and a drag force acting on it. This 
is illustrated in figure 9.6. The (scalar) position of the mass at time 7 is given by 
p(T). (Here 7 is continuous, i.e., a real number.) The position satisfies Newton’s 
law of motion, the differential equation 


dp 
” dt 


(T) + f(r); 


where m > 0 is the mass, f(T) is the external force acting on the mass at time 7, 
and 7 > 0 is the drag coefficient. The right-hand side is the total force acting on 
the mass; the first term is the drag force, which is proportional to the velocity and 
in the opposite direction. 
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Introducing the velocity of the mass, v(T) = dp(r)/dr, we can write the equation 
above as two coupled differential equations, 


Paso), mZ) = mr) + FO). 


The first equation relates the position and velocity; the second is from the law of 
motion. 


Discretization. To develop an (approximate) linear dynamical system model from 
the differential equations above, we first discretize time. We let h > 0 be a time 
interval (called the ‘sampling interval’) that is small enough that the velocity and 
forces do not change very much over h seconds. We define 


Pk = p(kh), Uk = v(kh), fr = f(kh), 


which are the continuous quantities ‘sampled’ at multiples of h seconds. We now 
use the approximations 


dp Pk+1 — Pk dv Uk+1 — Uk 
—(kh) x kh) x = 9.5 
pyu E E We gp) ny Peer e, (9.5) 


which are justified since h is small. This leads to the (approximate) equations 
(replacing ~ with =) 


Pk+1 — Pk Uk+1 — Uk __ 
—a— Uk, m h = fk — Up. 


Finally, using state x, = (pr, vk), we write this as 


1 h 0 
ma =| Q 1 —tinfm | + | ayn | fe k=1,2,..., 


which is a linear dynamical system of the form (9.2), with input fk and dynamics 
and input matrices 


A= [0 tn | = [vi | 


This linear dynamical system gives an approximation of the true motion, due to 
our approximation (9.5) of the derivatives. But for h small enough, it is accurate. 
This linear dynamical system can be used to simulate the motion of the mass, if 
we know the external force applied to it, i.e., us for t = 1,2,.... 

The approximation (9.5), which turns a set of differential equations into a re- 
cursion that approximates it, is called the Euler method, named after the math- 
ematician Leonhard Euler. (There are other, more sophisticated, methods for 
approximating differential equations as recursions. ) 
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Example. As a simple example, we consider the case with m = 1 (kilogram), 
n = 1 (Newtons per meter per second), and sampling period h = 0.01 (seconds). 
The external force is 


0.0 0.0<7<0.5 
10 05<7r<1.0 

FN=%4 13 10 <r<1.4 
0.0 1.4<r7. 


We simulate this system for a period of 2.5 seconds, starting from initial state zı = 
(0,0), which corresponds to the mass starting at rest (zero velocity) at position 0. 
The simulation involves iterating the dynamics equation from k = 1 to k = 250. 
Figure 9.7 shows the force, position, and velocity of the mass, with the axes labeled 
using continuous time T. 


Supply chain dynamics 


The dynamics of a supply chain can often be modeled using a linear dynamical 
system. (This simple model does not include some important aspects of a real 
supply chain, for example limits on storage at the warehouses, or the fact that 
demand fluctuates.) We give a simple example here. 

We consider a supply chain for a single divisible commodity (say, oil or gravel, or 
discrete quantities so small that their quantities can be considered real numbers). 
The commodity is stored at n warehouses or storage locations. Each of these 
locations has a target (desired) level or amount of the commodity, and we let the 
n-vector x, denote the deviations of the levels of the commodities from their target 
levels. For example, (25)3 is the actual commodity level at location 3, in period 5, 
minus the target level for location 3. If this is positive it means we have more than 
the target level at the location; if it is negative, we have less than the target level 
at the location. 

The commodity is moved or transported in each period over a set of m trans- 
portation links between the storage locations, and also enters and exits the nodes 
through purchases (from suppliers) and sales (to end-users). The purchases and 
sales are given by the n-vectors p; and s+, respectively. We expect these to be posi- 
tive; but they can be negative if we include returns. The net effect of the purchases 
and sales is that we add (p; — s+); of the commodity at location i. (This number is 
negative if we sell more than we purchase at the location.) 

We describe the links by the nxm incidence matrix A® (see §7.3). The direction 
of each link does not indicate the direction of commodity flow; it only sets the 
reference direction for the flow: Commodity flow in the direction of the link is 
considered positive and commodity flow in the opposite direction is considered 
negative. We describe the commodity flow in period t by the m-vector f. For 
example, (fg)2 = —1.4 means that in time period 6, 1.4 units of the commodity 
are moved along link 2 in the direction opposite the link direction (since the flow 
is negative). The n-vector A®“° f; gives the net flow of the commodity into the n 
locations, due to the transport across the links. 
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Figure 9.7 Simulation of mass moving along a line. Applied force (top), 


position (middle), and velocity (bottom). 
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Figure 9.8 A simple supply chain with n = 3 storage locations and m = 3 
transportation links. 


Taking into account the movement of the commodity across the network, and 
the purchase and sale of the commodity, we get the dynamics 


Lipi = Ti HA” fi tpe se t=1,2,.... 


In applications where we control or run a supply chain, s; is beyond our control, 
but we can manipulate f, (the flow of goods between storage locations) and p; 
(purchases at locations). This suggests treating s; as the offset, and u; = (fr, pe) 
as the input in a linear dynamical system with input (9.2). We can write the 
dynamics equations above in this form, with dynamics and input matrices 


A=i, B=| AY I]. 


(Note that A®*° refers to the supply chain graph incidence matrix, while A is the 
dynamics matrix in (9.2).) This gives 


Tizi = Axi + B( fi, pt) — St, t=1,2,...,. 


A simple example is shown in figure 9.8. The supply chain dynamics equation 


=i -1 0100 f 
t4i=%+! 1 0 -1010 | |- an t=1,2,.... 
0 1 1001 Pt 


It is a good exercise to check that the matrix-vector product (the middle term of 
the right-hand side) gives the amount of commodity added at each location, as a 
result of shipment and purchasing. 
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9.1 


9.2 


9.3 


Exercises 


Compartmental system. A compartmental system is a model used to describe the move- 
ment of some material over time among a set of n compartments of a system, and the 
outside world. It is widely used in pharmaco-kinetics, the study of how the concentration 
of a drug varies over time in the body. In this application, the material is a drug, and 
the compartments are the bloodstream, lungs, heart, liver, kidneys, and so on. Compart- 
mental systems are special cases of linear dynamical systems. 


In this problem we will consider a very simple compartmental system with 3 compart- 
ments. We let (x+); denote the amount of the material (say, a drug) in compartment i at 
time period t. Between period t and period t + 1, the material moves as follows. 


10% of the material in compartment 1 moves to compartment 2. (This decreases 
the amount in compartment 1 and increases the amount in compartment 2.) 


5% of the material in compartment 2 moves to compartment 3. 


e 5% of the material in compartment 3 moves to compartment 1. 


e 5% of the material in compartment 3 is eliminated. 


Express this compartmental system as a linear dynamical system, £++1 = Axı. (Give the 
matrix A.) Be sure to account for all the material entering and leaving each compartment. 


Dynamics of an economy. An economy (of a country or region) is described by an n- 
vector az, where (az); is the economic output in sector 7 in year t (measured in billions of 
dollars, say). The total output of the economy in year t is 17a;. A very simple model of 
how the economic output changes over time is az+1 = Bat, where B is an n x n matrix. 
(This is closely related to the Leontief input-output model described on page 157 of the 
book. But the Leontief model is static, i.e., doesn’t consider how an economy changes 
over time.) The entries of a; and B are positive in general. 


In this problem we will consider the specific model with n = 4 sectors and 


0.10 0.06 0.05 0.70 
0.48 0.44 0.10 0.04 
0.00 0.55 0.52 0.04 
0.04 0.01 0.42 0.51 


(a) Briefly interpret B23, in English. 


(b) Simulation. Suppose ai = (0.6, 0.9, 1.3,0.5). Plot the four sector outputs (i.e., (at): 
for i = 1,...,4) and the total economic output (i.e., 17 a4) versus t, for t = 1,..., 20. 


Equilibrium point for linear dynamical system. Consider a time-invariant linear dynamical 
system with offset, v141 = Azi+c, where 2; is the state n-vector. We say that a vector z is 
an equilibrium point of the linear dynamical system if xı = z implies £2 = z, 73 = Z,.... 
(In words: If the system starts in state z, it stays in state z.) 

Find a matrix F and vector g for which the set of linear equations Fz = g characterizes 
equilibrium points. (This means: If z is an equilibrium point, then F'z = g; conversely if 
Fz = g, then z is an equilibrium point.) Express F and g in terms of A, c, any standard 
matrices or vectors (e.g., J, 1, or 0), and matrix and vector operations. 


Remark. Equilibrium points often have interesting interpretations. For example, if the 
linear dynamical system describes the population dynamics of a country, with the vector c 
denoting immigration (emigration when entries of c are negative), an equilibrium point is 
a population distribution that does not change, year to year. In other words, immigration 
exactly cancels the changes in population distribution caused by aging, births, and deaths. 


9.4 


9.5 


9.6 


9.7 


Exercises 
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Reducing a Markov model to a linear dynamical system. Consider the 2-Markov model 
Titi = Aix, + Aorr-1, PETEA 


where x+ is an n-vector. Define z = (£+, xt+—1). Show that z satisfies the linear dynamical 
system equation zt+1 = Bz, for t = 2,3,..., where B is a (2n) x (2n) matrix. This idea 
can be used to express any K-Markov model as a linear dynamical system, with state 
(£t, PE ,£t-K+1).- 

Fibonacci sequence. The Fibonacci sequence yo, yi, Y2, ... starts with yo = 0, yı = 1, and 
for t = 2,3,..., ye is the sum of the previous two entries, i.e., y+—-1 + ye-2. (Fibonacci 
is the name used by the 13th century mathematician Leonardo of Pisa.) Express this 
as a time-invariant linear dynamical system with state x+ = (yt, ye-1) and output yt, 
for t = 1,2,.... Use your linear dynamical system to simulate (compute) the Fibonacci 
sequence up to t = 20. Also simulate a modified Fibonacci sequence Zo, 21, Z2,..., which 
starts with the same values z = 0 and z1 = 1, but for t = 2,3,..., z is the difference of 
the two previous values, i.e., Z:-1 — 2t—2. 


Recursive averaging. Suppose that ui, w2,... is a sequence of n-vectors. Let xı = 0, and 
for t = 2,3,..., let a: be the average of wi,...,Ut-1, i€., Ve = (U1 +--+ + Ut-1)/(t — 1). 
Express this as a linear dynamical system with input, i.e., ct41 = At£t + Brut, t = 1,2,... 
(with initial state zı = 0). Remark. This can be used to compute the average of an 
extremely large collection of vectors, by accessing them one-by-one. 


Complexity of linear dynamical system simulation. Consider the time-invariant linear 
dynamical system with n-vector state x; and m-vector input uz, and dynamics x41 = 
Axt + Bur, t =1,2,.... You are given the matrices A and B, the initial state xı, and the 
inputs u1,...,ur—1. What is the complexity of carrying out a simulation, i.e., computing 
£2, £3,..., £T? About how long would it take to carry out a simulation with n = 15, 
m = 5, and T = 10°, using a 1 Gflop/s computer? 
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Chapter 10 


Matrix multiplication 


In this chapter we introduce matrix multiplication, a generalization of matrix-vector 
multiplication, and describe several interpretations and applications. 


Matrix-matrix multiplication 


It is possible to multiply two matrices using matrix multiplication. You can multiply 
two matrices A and B provided their dimensions are compatible, which means the 
number of columns of A equals the number of rows of B. Suppose A and B are 
compatible, e.g., A has size m x p and B has size p x n. Then the product matrix 
C = AB is the m x n matrix with elements 


P 
Cij = X AnBrj = Ay Byj+---+AipBp;, i= die at ITD j = flee A (10.1) 
k=1 


There are several ways to remember this rule. To find the 7,7 element of the 
product C = AB, you need to know the ith row of A and the jth column of B. 
The summation above can be interpreted as ‘moving left to right along the ith row 
of A’ while moving ‘top to bottom’ down the jth column of B. As you go, you 
keep a running sum of the product of elements, one from A and one from B. 

As a specific example, we have 


-15 3 2 aoe ee 3.5 —4.5 

1 -1 0 i 
To find the 1,2 entry of the right-hand matrix, we move along the first row of 
the left-hand matrix, and down the second column of the middle matrix, to get 
(—1.5)(—1) + (3)(—2) + (2)(0) = —4.5. 


Matrix-matrix multiplication includes as special cases several other types of 
multiplication (or product) we have encountered so far. 
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Scalar-vector product. If x is an n-vector and a is a number, we can interpret the 
scalar-vector product xa, with the scalar appearing on the right, as matrix-matrix 
multiplication. We consider the n-vector x to be an n x 1 matrix, and the scalar 
a to be a 1 x 1 matrix. The matrix product za then makes sense, and is an n x 1 
matrix, which we consider the same as an n-vector. It coincides with the scalar- 
vector product xa, which we usually write (by convention) as ax. But note that ax 
cannot be interpreted as matrix-matrix multiplication (except when n = 1), since 
the number of columns of a (which is one) is not equal to the number of rows of x 
(which is n). 


Inner product. An important special case of matrix-matrix multiplication is the 
multiplication of a row vector with a column vector. If a and b are n-vectors, then 
the inner product 

ab = aıbı + az2b2 free Anbyn 
can be interpreted as the matrix-matrix product of the 1 x n matrix a? and the 
nx 1 matrix b. The result is a 1 x 1 matrix, which we consider to be a scalar. (This 
explains the notation afb for the inner product of vectors a and b, defined in §1.4.) 


Matrix-vector multiplication. The matrix-vector product y = Ax defined in (6.4) 
can be interpreted as a matrix-matrix product of A with the n x 1 matrix z. 


Vector outer product. The outer product of an m-vector a and an n-vector b is 
given by ab’, which is an m x n matrix 


abı aib? +++ abn 

T ab; ab aaa azbn 
ab = , 

amb amb2 n ambn 


whose entries are all products of the entries of a and the entries of b. Note that 
the outer product does not satisfy ab’ = baf, i.e., it is not symmetric (like the 
inner product). Indeed, the equation ab? = ba? does not even make sense, unless 
m =n; even then, it is not true in general. 


Multiplication by identity. If A is any m x n matrix, then AI = A and TA = A, 
i.e., when you multiply a matrix by an identity matrix, it has no effect. (Note the 
different sizes of the identity matrices in the formulas AJ = A and [A = A.) 


Matrix multiplication order matters. Matrix multiplication is (in general) not 
commutative: We do not (in general) have AB = BA. In fact, BA may not even 
make sense, or, if it makes sense, may be a different size than AB. For example, if 
A is 2 x 3 and B is 3 x 4, then AB makes sense (the dimensions are compatible) 
but BA does not even make sense (the dimensions are incompatible). Even when 
AB and BA both make sense and are the same size, i.e., when A and B are square, 
we do not (in general) have AB = BA. As a simple example, take the matrices 


[is e-f] 
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We have 


-6 ll -9 -3 
AB=| -3 | ba=| 37 i 


Two matrices A and B that satisfy AB = BA are said to commute. (Note that for 
AB = BA to make sense, A and B must both be square.) 


Properties of matrix multiplication. The following properties hold and are easy 
to verify from the definition of matrix multiplication. We assume that A, B, and 
C are matrices for which all the operations below are valid, and that y is a scalar. 


e Associativity: (AB)C = A(BC). Therefore we can write the product simply 
as ABC. 


e Associativity with scalar multiplication: y(AB) = (yA)B, where y is a scalar, 
and A and B are matrices (that can be multiplied). This is also equal to 
A(yB). (Note that the products yA and yB are defined as scalar-matrix 
products, but in general, unless A and B have one row, not as matrix-matrix 
products.) 


e Distributivity with addition. Matrix multiplication distributes across matrix 
addition: A(B+C) = AB+AC and (A+ B)C = AC+BC. On the right-hand 
sides of these equations we use the higher precedence of matrix multiplication 
over addition, so, for example, AC + BC is interpreted as (AC) + (BC). 


e Transpose of product. The transpose of a product is the product of the 
transposes, but in the opposite order: (AB)? = B’ A’. 


From these properties we can derive others. For example, if A, B, C, and D are 
square matrices of the same size, we have the identity 


(A+ B)(C + D) = AC + AD + BC + BD. 


This is the same as the usual formula for expanding a product of sums of scalars; 
but with matrices, we must be careful to preserve the order of the products. 


Inner product and matrix-vector products. As an exercise on matrix-vector prod- 
ucts and inner products, one can verify that if A is m x n, x is an n-vector, and y 
is an m-vector, then 
T T PINT 
y (Ax) = (y Aja = (A y) x, 


i.e., the inner product of y and Az is equal to the inner product of x and ATy. (Note 
that when m # n, these inner products involve vectors with different dimensions.) 


Products of block matrices. Suppose A is a block matrix with m x p block entries 
Aij, and B is a block matrix with p x n block entries B;;, and for each k = 1,...,p, 
the matrix product AixBp; makes sense, i.e., the number of columns of Aig equals 
the number of rows of B,;. (In this case we say that the block matrices conform or 
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are compatible.) Then C = AB can be expressed as the m x n block matrix with 
entries C;;, given by the formula (10.1). For example, we have 


A B E F |_| AE+BG AF+BH 

E AE Ae CF+DH |’ 
for any matrices A, B,..., H for which the matrix products above make sense. This 
formula is the same as the formula for multiplying two 2 x 2 matrices (i.e., with 
scalar entries); but when the entries of the matrix are themselves matrices (as in 
the block matrix above), we must be careful to preserve the multiplication order. 


Column interpretation of matrix-matrix product. We can derive some additional 
insight into matrix multiplication by interpreting the operation in terms of the 
columns of the second matrix. Consider the matrix product of an m x p matrix 
A and ap x n matrix B, and denote the columns of B by by. Using block-matrix 
notation, we can write the product AB as 


AB=A[ bi by +++ ba | =[ Ab Ab > Ab]. 


Thus, the columns of AB are the matrix-vector products of A and the columns 
of B. The product AB can be interpreted as the matrix obtained by ‘applying’ A 
to each of the columns of B. 


Multiple sets of linear equations. We can use the column interpretation of matrix 
multiplication to express a set of k linear equations with the same m x n coefficient 
matrix A, 

Ax; = bi, 6H Tj ney) 


in the compact form 
AX =B, 


where X = [zı --: xp] and B = [bı --- by]. The matrix equation AX = B is 
sometimes called a linear equation with matrix right-hand side, since it looks like 
Ax = b, but X (the variable) and B (the right-hand side) are now n x k matrices, 
instead of n-vectors (which are n x 1 matrices). 


Row interpretation of matrix-matrix product. We can give an analogous row 
interpretation of the product AB, by partitioning A and AB as block matrices 


with row vector blocks. Let af,...,a2, be the rows of A. Then we have 
af a? B (BP a)? 
as a. B (Baz)? 
an, az, B (BTam)” 


This shows that the rows of AB are obtained by applying BT to the transposed 
row vectors a, of A, and transposing the result. 
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Inner product representation. From the definition of the i,7 element of AB 
in (10.1), we also see that the elements of AB are the inner products of the rows 
of A with the columns of B: 


ath, at by e aT bn 
T T T 
a,b, azbz =  @3bn 
AB = ' SE . , 
al bı abby > albn 
where af are the rows of A and b; are the columns of B. Thus we can interpret 


the matrix-matrix product as the mn inner products afb; arranged in an m x n 
matrix. 


Gram matrix. For an mxn matrix A, with columns a1, .. . , an, the matrix product 
G = AT A is called the Gram matriz associated with the set of m-vectors a1,..., an- 
(It is named after the mathematician Jørgen Pedersen Gram.) From the inner 
product interpretation above, the Gram matrix can be expressed as 


T T T 

aya, Qia? +++ ayan 

z aZ ar as a2 ee as an 
G=A A= . 
T T 

apa1 4G, 42 Ap, An 


The entries of the Gram matrix G give all inner products of pairs of columns of A. 
Note that a Gram matrix is symmetric, since a? Tai. This can also be seen 
using the transpose-of-product rule: 


GT = (AT A)T = (AT)(AT)T = ATA =G. 


aj=a 


The Gram matrix will play an important role later in this book. 
As an example, suppose the m x n matrix A gives the membership of m items 
in n groups, with entries 


A= 1 item 27 is in group 7 
‘I ) 0 item 7 is not in group j. 


(So the jth column of A gives the membership in the jth group, and the ith row 
gives the groups that item i is in.) In this case the Gram matrix G has a nice 
interpretation: Gj; is the number of items that are in both groups 7 and j, and Gii 
is the number of items in group i. 


Outer product representation. If we express the m x p matrix A in terms of its 


columns a,,...,@, and the p x n matrix B in terms of its rows OF wads b7, 
F 
bi 
A=|a TE ap |, B= : ; 
T 
bp 


then we can express the product matrix AB as a sum of outer products: 


AB = ab] + + apd). 
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Complexity of matrix multiplication. The total number of flops required for a 
matrix-matrix product C = AB with A of size m x p and B of size p x n can be 
found several ways. The product matrix C has size m x n, so there are mn elements 
to compute. The i,j element of C is the inner product of row i of A with column 
j of B. This is an inner product of vectors of length p and requires 2p — 1 flops. 
Therefore the total is mn(2p— 1) flops, which we approximate as 2mnp flops. The 
order of computing the matrix-matrix product is mnp, the product of the three 
dimensions involved. 

In some special cases the complexity is less than 2mnp flops. As an example, 
when we compute the n x n Gram matrix G = BTB we only need to compute the 
entries in the upper (or lower) half of G, since G is symmetric. This saves around 
half the flops, so the complexity is around pn? flops. But the order is the same. 


Complexity of sparse matrix multiplication. Multiplying sparse matrices can be 
done efficiently, since we don’t need to carry out any multiplications in which one 
or the other entry is zero. We start by analyzing the complexity of multiplying 
a sparse matrix with a non-sparse matrix. Suppose that A is m x p and sparse, 
and B is p x n, but not necessarily sparse. The inner product of the ith row a? 
of A with the jth column of B requires no more than 2nnz(a?’) flops. Summing 
over i = 1,...,m and j = 1,...,n we get 2nnz(A)n flops. If B is sparse, the 
total number of flops is no more that 2nnz(B)m flops. (Note that these formulas 
agree with the one given above, 2mnp, when the sparse matrices have all entries 
nonzero.) 

There is no simple formula for the complexity of multiplying two sparse matri- 
ces, but it is certainly no more than 2 min{nnz(A)n, nnz(B)m} flops. 


Complexity of matrix triple product. Consider the product of three matrices, 
D = ABC 


with A of size m x n, B of size n x p, and C of size p x q. The matrix D can be 
computed in two ways, as (AB)C and as A(BC). In the first method we start with 
AB (2mnp flops) and then form D = (AB)C (2mpq flops), for a total of 2mp(n+q) 
flops. In the second method we compute the product BC (2npq flops) and then 
form D = A(BC) (2mngq flops), for a total of 2ng(m + p) flops. 

You might guess that the total number of flops required is the same with the 
two methods, but it turns out it is not. The first method is less expensive when 
2mp(n +q) < 2ngq(m + p), i.e., when 


1 1 Te il 
-+=5<> +5. 
n q mo p 
For example, if m = p and n = q, the first method has a complexity proportional 
to m?n, while the second method has complexity mn?, and one would prefer the 
first method when m < n. 
As a more specific example, consider the product ab’c, where a,b,c are n- 
vectors. If we first evaluate the outer product ab’, the cost is n? flops, and we 
need to store n? values. We then multiply the vector c by this n x n matrix, which 


10.2 
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costs 2n? flops. The total cost is 3n? flops. On the other hand if we first evaluate 
the inner product bfc, the cost is 2n flops, and we only need to store one number 
(the result). Multiplying the vector a by this number costs n flops, so the total cost 
is 3n flops. For n large, there is a dramatic difference between 3n and 3n? flops. 
(The storage requirements are also dramatically different for the two methods of 
evaluating ab’c: one number versus n? numbers.) 


Composition of linear functions 


Matrix-matrix products and composition. Suppose A is an m x p matrix and B 
is p x n. We can associate with these matrices two linear functions f : R? => R” 
and g : R” > R?, defined as f(x) = Ax and g(x) = Bx. The composition of the 
two functions is the function h : R” > R” with 


h(x) = f(g(@)) = A(Ba) = (AB). 


In words: To find h(a), we first apply the function g, to obtain the partial result 
g(x) (which is a p-vector); then we apply the function f to this result, to obtain 
h(x) (which is an m-vector). In the formula h(x) = f(g(x)), f appears to the left 
of g; but when we evaluate h(a), we apply g first. The composition h is evidently 
a linear function, that can be written as h(x) = Cx with C = AB. 

Using this interpretation of matrix multiplication as composition of linear func- 
tions, it is easy to understand why in general AB # BA, even when the dimen- 
sions are compatible. Evaluating the function h(x) = ABx means we first evaluate 
y = Ba, and then z = Ay. Evaluating the function BAr means we first evaluate 
y = Ax, and then z = By. In general, the order matters. As an example, take the 


2 x 2 matrices 
—1]1 0 0 1 
a= h s[i 0]: 


0 -1 0 1 
exes), Bafi] 


for which 


The mapping f(x) = Ax = (—z1, z2) changes the sign of the first element of the 
vector x. The mapping g(x) = Ba = (2,21) reverses the order of two elements 
of x. If we evaluate f(g(x)) = ABx = (—zx2, x1), we first reverse the order, and 
then change the sign of the first element. This result is obviously different from 
g(f(@)) = BAx = (a#2,—21), obtained by changing the sign of the first element, 
and then reversing the order of the elements. 


Second difference matrix. As a more interesting example of composition of linear 
functions, consider the (n — 1) x n difference matrix D, defined in (6.5). (We use 
the subscript n here to denote size of D.) Let Dn—ı denote the (n — 2) x (n — 1) 
difference matrix. Their product Dn—-1Dn is called the second difference matriz, 
and sometimes denoted A. 
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We can interpret A in terms of composition of linear functions. Multiplying an 
n-vector x by Dn yields the (n — 1)-vector of consecutive differences of the entries: 


Dng = (£2 — £1,..-,2n — Ln-1)- 


Multiplying this vector by D,_1 gives the (n — 2)-vector of consecutive differences 
of consecutive differences (or second differences) of x: 


Dn-1Dyx = (£1 — 2x2 + £3, Lo — 2434+ T4, ..., En—2 — 2Wn-1 + Tn). 


The (n — 2) x n product matrix A = Dn-1Dn is the matrix associated with the 
second difference function. 
For the case n = 5, A = Dy,_1 Dy has the form 


1-2 1 00 -1 1 00 cee ae a 
0 1-2 10;/=/ 0-1 10 Ar tae ee ease 
0 0 1-21 0 0 -1 1 a is Oo 


The left-hand matrix A is associated with the second difference linear function 
that maps 5-vectors into 3-vectors. The middle matrix D4, is associated with the 
difference function that maps 4-vectors into 3-vectors. The right-hand matrix Ds 
is associated with the difference function that maps 5-vectors into 4-vectors. 


Composition of affine functions. The composition of affine functions is an affine 
function. Suppose f : RP” > R”™ is the affine function given by f(x) = Ax +, and 
g : R” > R? is the affine function given by g(x) = Cx +d. The composition h is 
given by 


h(x) = f(g(x)) = A(Cx + d) +b = (AC) x + (Ad + b) = Ax + b, 


where A = AC, b = Ad + b. 


Chain rule of differentiation. Let f : RP? > R” and g : R” > R? be dif- 
ferentiable functions. The composition of f and g is defined as the function 
h: R” > R” with 


h(x) = f(g(@)) = f(g (2), +++; Gp(@))- 


The function h is differentiable and its partial derivatives follow from those of f 
and g via the chain rule: 


ONG OF; Ogi Ofi O9p 
ax; 4) = Oy: IZ a,” yp (2) ay) 
for i = 1,...,m and j = 1,...,n. This relation can be expressed concisely as a 


matrix-matrix product: The derivative matrix of h at z is the product 


Dh(z) = Df(g(z)) Dalz) 
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of the derivative matrix of f at g(z) and the derivative matrix of g at z. This 
compact matrix formula generalizes the chain rule for scalar-valued functions of a 
single variable, i.e., h’(z) = f’(g(z))g’(z). 

The first order Taylor approximation of h at z can therefore be written as 


h(x) 


h(z) + Dh(z)(a — z) 
= f(g9(z)) + Df(g(z))Dg(z)(x — 2). 


The same result can be interpreted as a composition of two affine functions, the 
first order Taylor approximation of f at g(z), 


fy) = F(g(z)) + Dalz) — g(2)) 
and the first order Taylor approximation of g at z, 
W(x) = g(z) + Dg(z) (2 — z). 


The composition of these two affine functions is 


fal) = F(g(z) + Dale) - 2) 
= flg(z)) + DF) Ul) + Dalz)(@ — z) — g(2)) 
= flg(z)) + Df(g(z))Dalz)(@ — 2) 


which is equal to h(«). 
When f is a scalar-valued function (m = 1), the derivative matrices Dh(z) and 
Df (g(z)) are the transposes of the gradients, and we write the chain rule as 


Vh(z) = Dg(2)"V f(g(2)). 


In particular, if g(x) = Ax + b is affine, then the gradient of h(a) = f(g(x)) = 
f(Ax +b) is given by VA(z) = ATV f(Az + b). 


Linear dynamical system with state feedback. We consider a time-invariant lin- 
ear dynamical system with n-vector state x, and m-vector input u+, with dynamics 


ti41 = At, + Bur, $= 1.258255 


Here we think of the input us as something we can manipulate, e.g., the control 
surface deflections for an airplane or the amount of material we order or move in a 
supply chain. In state feedback control the state x; is measured, and the input uz 
is a linear function of the state, expressed as 


Ut = Kii 


where K is the m x n state-feedback gain matrix. The term feedback refers to 
the idea that the state is measured, and then (after multiplying by K) fed back 
into the system, via the input. This leads to a loop, where the state affects the 
input, and the input affects the (next) state. State feedback is very widely used 
in many applications. (In §17.2.3 we will see methods for choosing or designing an 
appropriate state feedback matrix.) 
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With state feedback, we have 
Tt4+1 = Ax, + But = Ax; + B(K2) = (A + BK )«1, t= 1,2, esee 


This recursion is called the closed-loop system. The matrix A + BK is called the 
closed-loop dynamics matriz. (In this context, the recursion x44; = Az; is called 
the open-loop system. It gives the dynamics when uz = 0.) 


Matrix power 


It makes sense to multiply a square matrix A by itself to form AA. We refer to 
this matrix as A’. Similarly, if k is a positive integer, then k copies of A multiplied 
together is denoted A*. If k and l are positive integers, and A is square, then 
AFA! = A®*! and (A*)! = A*. By convention we take A? = I, which makes the 
formulas above hold for all nonnegative integer values of k and l. 

We should mention one ambiguity in matrix power notation that occasionally 
arises. When A is a square matrix and T is a nonnegative integer, AT can mean 
either the transpose of the matrix A or its Tth power. Usually which is meant is 
clear from the context, or the author explicitly states which meaning is intended. 
To avoid this ambiguity, some authors use a different symbol for the transpose, 
such as AT (with the superscript in roman font) or A’, or avoid referring to the 
Tth power of a matrix. When A is not square there is no ambiguity, since AT can 
only be the transpose in this case. 


Other matrix powers. Matrix powers A* with k a negative integer will be dis- 
cussed in §11.2. Non-integer powers, such as A‘/? (the matrix squareroot), need 
not make sense, or can be ambiguous, unless certain conditions on A hold. This is 
an advanced topic in linear algebra that we will not pursue in this book. 


Paths in a directed graph. Suppose A is the n x n adjacency matrix of a directed 
graph with n vertices: 


A= 1 there is a edge from vertex j to vertex i 
I | O otherwise 


(see page 112). A path of length Z is a sequence of + 1 vertices, with an edge 
from the first to the second vertex, an edge from the second to third vertex, and 
so on. We say the path goes from the first vertex to the last one. An edge can be 
considered a path of length one. By convention, every vertex has a path of length 
zero (from the vertex to itself). 

The elements of the matrix powers A‘ have a simple meaning in terms of paths 
in the graph. First examine the expression for the i, 7 element of the square of A: 


(Aig = DF Ait Any. 


k=1 


10.3 Matrix power 


187 


1" 


Figure 10.1 Directed graph. 


Each term in the sum is 0 or 1, and equal to one only if there is an edge from vertex 
j to vertex k and an edge from vertex k to vertex i, i.e., a path of length exactly 
two from vertex j to vertex i via vertex k. By summing over all k, we obtain the 
total number of paths of length two from j to i. 

The adjacency matrix A for the graph in figure 10.1, for example, and its square 
are given by 


01001 1 0110 
10100 0 11 12 
A=|00111ilļl, A?=/10121 
10000 01001 
00010 10000 


We can verify there is exactly one path of length two from vertex 1 to itself, i.e., 
the path (1,2,1)), and one path of length two from vertex 3 to vertex 1, i.e., the 
path (3,2,1). There are two paths of length two from vertex 4 to vertex 3, (4,3,3) 
and (4,5,3), so (A?)34 = 2. 

The property extends to higher powers of A. If Z is a positive integer, then 
the i, j element of A‘ is the number of paths of length £ from vertex j to vertex i. 
This can be proved by induction on £. We have already shown the result for £ = 2. 
Assume that it is true that the elements of A’ give the paths of length £ between 
the different vertices. Consider the expression for the i, j element of A‘*?: 


(Ajy = So Ag (AD ay 
k=1 


The kth term in the sum is equal to the number of paths of length £ from j to k if 
there is an edge from k to i, and is equal to zero otherwise. Therefore it is equal 
to the number of paths of length £ + 1 from j to i that end with the edge (k, i), 
i.e., of the form (j,...,k,7). By summing over all k we obtain the total number of 
paths of length £ + 1 from vertex j to i. 
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Figure 10.2 Contribution factor per age in 2010 to the total population in 
2020. The value for age i — 1 is the ith component of the row vector 17 A. 


This can be verified in the example. The third power of A is 


i 1 112 
2 0 2 3 1 
AÆA=|2 11 2 2 
104110 
01001 


The (A?)24 = 3 paths of length three from vertex 4 to vertex 2 are (4,3, 3,2), 
(4, 5, 3, 2), (4, 5, 1, 2). 


Linear dynamical system. Consider a time-invariant linear dynamical system, 
described by x44; = Axı. We have r442 = Azı = A(Ax;) = A?x;. Continuing 
this argument, we have 

tire = A’ae, 


for l = 1,2,.... In a linear dynamical system, we can interpret A’ as the matrix 
that propagates the state forward £ time steps. 

For example, in a population dynamics model, A‘ is the matrix that maps the 
current population distribution into the population distribution £ periods in the 
future, taking into account births, deaths, and the births and deaths of children, 
and so on. The total population £ periods in the future is given by 17(A‘2;), which 
we can write as (17 A‘)a,. The row vector 17 Af has an interesting interpretation: 
Its ith entry is the contribution to the total population in £ periods due to each 
person with current age i — 1. It is plotted in figure 10.2 for the US data given 
in §9.2. 
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Matrix powers also come up in the analysis of a time-invariant linear dynamical 
system with an input. We have 


Tt+2 = AXt41 + Bursi = A(Axy + Bu) = Ax, + ABut + Burzi. 
Iterating this over £ periods we obtain 
t+ = A’ xy + A! Bu; + A? Burst Sana Buryte-1- (10.2) 


(The first term agrees with the formula for x+ with no input.) The other terms 
are readily interpreted. The term A’ Buz4e_; is the contribution to the state £42 
due to the input at time t+ ¢—j. 


QR factorization 


Matrices with orthonormal columns. As an application of Gram matrices, we 
can express the condition that the n-vectors a1, ...,ap are orthonormal in a simple 
way using matrix notation: 

ATA =T, 


where A is the n x k matrix with columns a1, ...,ap. There is no standard term for 
a matrix whose columns are orthonormal: We refer to a matrix whose columns are 
orthonormal as ‘a matrix whose columns are orthonormal’. But a square matrix 
that satisfies ATA = I is called orthogonal; its columns are an orthonormal basis. 
Orthogonal matrices have many uses, and arise in many applications. 

We have already encountered some orthogonal matrices, including identity ma- 


trices, 2-D reflections and rotations (page 129), and permutation matrices (page 132). 


Norm, inner product, and angle properties. Suppose the columns of the m x n 
matrix A are orthonormal, and x and y are any n-vectors. We let f : R” > R” 
be the function that maps z to Az. Then we have the following: 


e ||Ax|| = ||z||. That is, f is norm preserving. 
e (Ax)T (Ay) = x7 y. f preserves the inner product between vectors. 
e /(Ax, Ay) = Z(x,y). f also preserves angles between vectors. 


Note that in each of the three equations above, the vectors appearing in the left- 
and right-hand sides have different dimensions, m on the left and n on the right. 

We can verify these properties using simple matrix properties. We start with 
the second statement, that multiplication by A preserves the inner product. We 
have 


(Ax)"(Ay) = (#7 A™)(Ay) 
= a (ATA)y 
= xT Ty 


= gly. 
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In the first line, we use the transpose-of-product rule; in the second, we re-associate 
a product of 4 matrices (considering the row vector «7 and column vector x as 
matrices); in the third line, we use AT A = J; and in the fourth line, we use Ty = y. 

From the second property we can derive the first one: By taking y = x we get 
(Ax)? (Ax) = zT x; taking the squareroot of each side gives ||Az|| = ||z||. The third 
property, angle preservation, follows from the first two, since 


— arccos ( (4%) (AY) L arccos [ £2) = Ha 
a aad ( acer) =e Carp =e) 


QR factorization. We can express the result of the Gram-Schmidt algorithm de- 
scribed in §5.4 in a compact form using matrices. Let A be an n x k matrix 
with linearly independent columns a1,...,a@p. By the independence-dimension in- 
equality, A is tall or square. Let Q be the n x k matrix with columns qj,..., qk, 
the orthonormal vectors produced by the Gram-Schmidt algorithm applied to the 
n-vectors a1,...,@%- Orthonormality of qi,...,q% is expressed in matrix form as 
QTQ = I. We express the equation relating a; and qi, 


ai = (qr ai)q +++ + (q4 )qi-1 + l&i 
where q; is the vector obtained in the first step of the Gram-Schmidt algorithm, as 
ai = Riqi +--+ + Riqi, 


where Rij = qla; for i < j and Ri = ||G||. Defining Ri; = 0 for i > j, we can 
write the equations above in compact matrix form as 


A=QR. 


This is called the QR factorization of A, since it expresses the matrix A as a 
product of two matrices, Q and R. The n x k matrix Q has orthonormal columns, 
and the k x k matrix R is upper triangular, with positive diagonal elements. If 
A is square, with linearly independent columns, then Q is orthogonal and the QR 
factorization expresses A as a product of two square matrices. 

The attributes of the matrices Q and R in the QR factorization come directly 
from the Gram-Schmidt algorithm. The equation QTQ = I follows from the 
orthonormality of the vectors q1,..., qp. The matrix R is upper triangular because 
each vector a; is a linear combination of qi,...,q@- 

The Gram-Schmidt algorithm is not the only algorithm for QR factorization. 
Several other QR factorization algorithms exist, that are more reliable in the pres- 
ence of round-off errors. (These QR factorization methods may also change the 
order in which the columns of A are processed.) 


Sparse QR factorization. There are algorithms for QR factorization that effi- 
ciently handle the case when the matrix A is sparse. In this case the matrix Q is 
stored in a special format that requires much less memory than if it were stored 
as a generic n x k matrix, i.e., nk numbers. The flop count for these sparse QR 
factorizations is also much smaller than 2nk?. 
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Exercises 

Scalar-row-vector multiplication. Suppose a is a number and x = [zı -> Zp] is an n- 
row-vector. The scalar-row-vector product az is the n-row-vector [axı --- aan]. Is this a 


special case of matrix-matrix multiplication? That is, can you interpret scalar-row-vector 
multiplication as matrix multiplication? (Recall that scalar-vector multiplication, with 
the scalar on the left, is not a special case of matrix-matrix multiplication; see page 177.) 


Ones matrix. There is no special notation for an m x n matrix all of whose entries are 
one. Give a simple expression for this matrix in terms of matrix multiplication, transpose, 
and the ones vectors 1m, 1n (where the subscripts denote the dimension). 


Matriz sizes. Suppose A, B, and C are matrices that satisfy A+ BBT = C. Determine 
which of the following statements are necessarily true. (There may be more than one true 
statement.) 


(a) A is square. 
(b) A and B have the same dimensions. 
(c) A, B, and C have the same number of rows. 


(d) B is a tall matrix. 


Block matrix notation. Consider the block matrix 


I B 0 
A=| BT 0 0 f 
0 0 BB? 


where B is 10 x 5. What are the dimensions of the four zero matrices and the identity 
matrix in the definition of A? What are the dimensions of A? 


When is the outer product symmetric? Let a and b be n-vectors. The inner product is 
symmetric, i.e., we have ab = ba. The outer product of the two vectors is generally 
not symmetric; that is, we generally have ab’ 4 ba”. What are the conditions on a and b 
under which ab = ba"? You can assume that all the entries of a and b are nonzero. (The 
conclusion you come to will hold even when some entries of a or b are zero.) Hint. Show 
that ab” = ba” implies that a;/b; is a constant (i.e., independent of i). 


Product of rotation matrices. Let A be the 2 x 2 matrix that corresponds to rotation by 
0 radians, defined in (7.1), and let B be the 2 x 2 matrix that corresponds to rotation by 
w radians. Show that AB is also a rotation matrix, and give the angle by which it rotates 
vectors. Verify that AB = BA in this case, and give a simple English explanation. 


Two rotations. Two 3-vectors x and y are related as follows. First, the vector x is rotated 
40° around the e3 axis, counterclockwise (from e; toward e2), to obtain the 3-vector z. 
Then, z is rotated 20° around the e; axis, counterclockwise (from e2 toward e3), to form y. 
Find the 3 x 3 matrix A for which y = Az. Verify that A is an orthogonal matrix. Hint. 
Express A as a product of two matrices, which carry out the two rotations described 
above. 


Entries of matriz triple product. (See page 182.) Suppose A has dimensions m x n, B has 
dimensions n x p, C has dimensions p x q, and let D = ABC. Show that 


n p 
Dij = 5 5 Aik BriCij. 
k=1 l=1 
This is the formula analogous to (10.1) for the product of two matrices. 


Multiplication by a diagonal matrix. Suppose that A is an m x n matrix, D is a diagonal 
matrix, and B = DA. Describe B in terms of A and the entries of D. You can refer to 
the rows or columns or entries of A. 


192 


10 Matrix multiplication 


10.10 


10.11 


10.12 


10.13 


10.14 
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Converting from purchase quantity matrix to purchase dollar matriz. An n x N matrix 
Q gives the purchase history of a set of n products by N customers, over some period, 
with Qi; being the quantity of product i bought by customer j. The n-vector p gives 
the product prices. A data analyst needs the n x N matrix D, where Dj; is the total 
dollar value that customer 7 spent on product i. Express D in terms of Q and p, using 
compact matrix/vector notation. You can use any notation or ideas we have encountered, 
e.g., stacking, slicing, block matrices, transpose, matrix-vector product, matrix-matrix 
product, inner product, norm, correlation, diag(), and so on. 


Trace of matrix-matriz product. The sum of the diagonal entries of a square matrix is 
called the trace of the matrix, denoted tr(A). 


(a) Suppose A and B are m x n matrices. Show that 


n 


tr(A’ B) = y 5 Aij Bij. 


i=1 j=1 


What is the complexity of calculating tr(AT B)? 


(b) The number tr(A7 B) is sometimes referred to as the inner product of the matrices 
A and B. (This allows us to extend concepts like angle to matrices.) Show that 


tr(A” B) = tr(BT A). 


(c) Show that tr(A7 A) = ||A||?. In other words, the square of the norm of a matrix is 
the trace of its Gram matrix. 


(d) Show that tr(ATB) = tr(BA7), even though in general ATB and BAT can have 
different dimensions, and even when they have the same dimensions, they need not 
be equal. 


Norm of matrix product. Suppose A is an m x p matrix and B is a p x n matrix. Show 
that || AB] < ||Al||| Bl], i.e., the (matrix) norm of the matrix product is no more than 
the product of the norms of the matrices. Hint. Let at,...,a2, be the rows of A, and 
bi,..., bn be the columns of B. Then 


ABI? = X X (a7). 


i=1 j=1 


Now use the Cauchy—Schwarz inequality. 


Laplacian matrix of a graph. Let A be the incidence matrix of a directed graph with n 
nodes and m edges (see §7.3). The Laplacian matriz associated with the graph is defined 
as L = AAT, which is the Gram matrix of AT. It is named after the mathematician 
Pierre-Simon Laplace. 


(a) Show that D(v) = v? Lv, where D(v) is the Dirichlet energy defined on page 135. 


(b) Describe the entries of L. Hint. The following two quantities might be useful: The 
degree of a node, which is the number of edges that connect to the node (in either 
direction), and the number of edges that connect a pair of distinct nodes (in either 
direction). 


Gram matrix. Let ai,...,@n be the columns of the m x n matrix A. Suppose that the 
columns all have norm one, and for i Æ j, Z(ai,a;) = 60°. What can you say about the 
Gram matrix G = AT A? Be as specific as you can be. 

Pairwise distances from Gram matriz. Let A be an m xn matrix with columns ai,...,@n, 


and associated Gram matrix G = A’ A. Express ||a; — a;|| in terms of G, specifically Gi, 
Gij, and Gjj. 
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10.16 Covariance matrix. Consider a list of k n-vectors a1,...,a@%, and define the n x k matrix 
A= [a1 Baca ak]. 
(a) Let the k-vector u give the means of the columns, i.e., p; = avg(ai), i = 1,...,k. 


(The symbol p is a traditional one to denote an average value.) Give an expression 
for u in terms of the matrix A. 


(b) Let G@1,...,@,% be the de-meaned versions of a1,...,ap, and define Ã as the n x k 
matrix A = [@ --- &]. Give a matrix expression for A in terms of A and p. 
(c) The covariance matrix of the vectors a1,..., an is the k x k matrix X = (1/N)A7 A, 
the Gram matrix of A multiplied with 1/N. Show that 
S std(a;)? i=j 
O | std(a:)std(aj)pi; 1 AJ 


where pi; is the correlation coefficient of a; and aj. (The expression for i 4 j assumes 
that pi; is defined, i.e., std(a;) and std(a;) are nonzero. If not, we interpret the 
formula as X;; = 0.) Thus the covariance matrix encodes the standard deviations 
of the vectors, as well as correlations between all pairs. The correlation matrix is 
widely used in probability and statistics. 


(d) Let z1,...,2% be the standardized versions of ai,...,ax. (We assume the de-meaned 
vectors are nonzero.) Derive a matrix expression for Z = [z1 -+> zg], the ma- 
trix of standardized vectors. Your expression should use A, u, and the numbers 
std(ai),...,std(ax). 


10.17 Patients and symptoms. Each of a set of N patients can exhibit any number of a set of n 
symptoms. We express this as an N x n matrix S, with 


a 1 patient 7 exhibits symptom j 
“7 ) O patient i does not exhibit symptom j. 


Give simple English descriptions of the following expressions. Include the dimensions, 
and describe the entries. 


(a) $1. 
(b) S71. 
(c) STS. 
(d) SST. 
10.18 Students, classes, and majors. We consider m students, n classes, and p majors. Each 
student can be in any number of the classes (although we’d expect the number to range 
from 3 to 6), and can have any number of the majors (although the common values would 


be 0, 1, or 2). The data about the students’ classes and majors are given by an m x n 
matrix C and an m x p matrix M, where 


C = 1 student i is in class j 
“7 ) 0 student i is not in class j, 


and 


M.. = 1 student 7 is in major j 
a 0 student 7 is not in major j. 


(a) Let E be the n-vector with E; being the enrollment in class i. Express E using 
matrix notation, in terms of the matrices C and M. 


(b) Define the n x p matrix S where Sij is the total number of students in class i with 
major j. Express S using matrix notation, in terms of the matrices C and M. 
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Student group membership. Let G € R™*” represent a contingency matrix of m students 
who are members of n groups: 


G= 1 student 7 is in group j 
k 0 student i is not in group j. 


(A student can be in any number of the groups.) 


(a) What is the meaning of the 3rd column of G? 
(b) What is the meaning of the 15th row of G? 


(c) Give a simple formula (using matrices, vectors, etc.) for the n-vector M, where M; 
is the total membership (i.e., number of students) in group i. 

(d) Interpret (GGT ):; in simple English. 

(e) Interpret (G7G);; in simple English. 
Products, materials, and locations. P different products each require some amounts of M 
different materials, and are manufactured in L different locations, which have different 
material costs. We let Cim denote the cost of material m in location l, for l =1,..., L and 
m = 1,...,M. We let Qmp denote the amount of material m required to manufacture 
one unit of product p, form =1,..., M and p = 1,..., P. Let Tpı denote the total cost to 


manufacture product p in location l, for p = 1,..., P and l =1,..., L. Give an expression 
for the matrix T. 


Integral of product of polynomials. Let p and q be two quadratic polynomials, given by 
p(x) = cı + co" + c3’, q(x) = dı + d2£ + d3x?. 


Express the integral J = if p(x)q(x) dx in the form J = c’ Gd, where G is a 3 x 3 matrix. 
Give the entries of G (as numbers). 


Composition of linear dynamical systems. We consider two time-invariant linear dynami- 
cal systems with outputs. The first one is given by 


£t = Art + Bur, ye= Cay beg 2ycce, 


with state x+, input uz, and output y+. The second is given by 


Fig = AZ, + Bur, vi =O%, t=1,2,..., 


with state +, input w:, and output v. We now connect the output of the first linear 
dynamical system to the input of the second one, which means we take wi: = yz. (This 
is called the composition of the two systems.) Show that this composition can also be 
expressed as a linear dynamical system with state z: = (x+, +), input wz, and output vz. 
(Give the state transition matrix, input matrix, and output matrix.) 


Suppose A is an n x n matrix that satisfies A? = 0. Does this imply that A = 0? (This 
is the case when n = 1.) If this is (always) true, explain why. If it is not, give a specific 
counterexample, i.e., a matrix A that is nonzero but satisfies A? = 0. 


Matrix power identity. A student says that for any square matrix A, 


(A +D? = A? +34? +3A +I. 


Is she right? If she is, explain why; if she is wrong, give a specific counterexample, i.e., a 
square matrix A for which it does not hold. 


Squareroots of the identity. The number 1 has two squareroots (i.e., numbers who square 
is 1), 1 and —1. The n x n identity matrix I„ has many more squareroots. 


(a) Find all diagonal squareroots of In. How many are there? (For n = 1, you should 
get 2.) 
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(b) Find a nondiagonal 2 x 2 matrix A that satisfies A? = I. This means that in general 
there are even more squareroots of J, than you found in part (a). 


10.26 Circular shift matrices. Let A be the 5 x 5 matrix 


D 

II 
SGooro 
ooroo 
o.mooo 
FPoooco 
ocoocor 


(a) How is Ax related to x? Your answer should be in English. Hint. See exercise title. 


(b) What is Aë? Hint. The answer should make sense, given your answer to part (a). 


10.27 Dynamics of an economy. Let %1,22,... be n-vectors that give the level of economic 
activity of a country in years 1,2,..., in n different sectors (like energy, defense, manu- 
facturing). Specifically, (x+); is the level of economic activity in economic sector i (say, 
in billions of dollars) in year t. A common model that connects these economic activity 
vectors is t141 = Bat, where B is an n x n matrix. (See exercise 9.2.) 


Give a matrix expression for the total economic activity across all sectors in year t = 6, 
in terms of the matrix B and the vector of initial activity levels x1. Suppose you can 
increase economic activity in year t = 1 by some fixed amount (say, one billion dollars) 
in one sector, by government spending. How should you choose which sector to stimulate 
so as to maximize the total economic output in year t = 6? 


10.28 Controllability matriz. Consider the time-invariant linear dynamical system £t+1 = Azt + 
Buz, with n-vector state x; and m-vector input ue. Let U = (u1, u2,...,ur—1) denote 
the sequence of inputs, stacked in one vector. Find the matrix Cr for which 


rr = Alte, +CrU 


holds. The first term is what x7 would be if u = --- = ur-ı = 0; the second term 
shows how the sequence of inputs ui,...,ur—1 affect xr. The matrix Cr is called the 
controllability matrix of the linear dynamical system. 


10.29 Linear dynamical system with 2x down-sampling. We consider a linear dynamical system 
with n-vector state x+, m-vector input ut, and dynamics given by 
ziyi = Axı + Bu, t=1,2,..., 
where A is n x n matrix A and B is nx m. Define zt = vo1-1 fort = 1,2,..., i.e., 


zı = 1, 22 = T3, 23 = T5,;.... 


(The sequence z is the original state sequence x; ‘down-sampled’ by 2x.) Define the 
(2m)-vectors w: as w: = (uzt—1, U2t) for t = 1,2,..., ie., 


wi = (ui, u2), w2 = (us, ua), W3 = (us, us), ewes 


(Each entry of the sequence w is a stack of two consecutive original inputs.) Show that 
Zt, Wr Satisfy the linear dynamics equation z141 = Fz + Gui, for t = 1,2,.... Give the 
matrices F and G in terms of A and B. 


10.30 Cycles in a graph. A cycle of length £ in a directed graph is a path of length £ that starts 
and ends at the same vertex. Determine the total number of cycles of length £ = 10 for 
the directed graph given in the example on page 187. Break this number down into the 
number of cycles that begin (and end) at vertex 1, vertex 2, ..., vertex 5. (These should 
add up to the total.) Hint. Do not count the cycles by hand. 
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10.32 


10.33 


10.34 


10.35 


Diameter of a graph. A directed graph with n vertices is described by its n x n adjacency 
matrix A (see §10.3). 


(a) Derive an expression P;; for the total number of paths, with length no more than k, 
from vertex j to vertex i. (We include in this total the number of paths of length 
zero, which go from each vertex j to itself.) Hint. You can derive an expression for 
the matrix P, in terms of the matrix A. 


(b) The diameter D of a graph is the smallest number for which there is a path of length 
< D from node j to node i, for every pair of vertices j and i. Using part (a), explain 
how to compute the diameter of a graph using matrix operations (such as addition, 
multiplication). 


Remark. Suppose the vertices represent all people on earth, and the graph edges represent 
acquaintance, i.e., Aij = 1 if person j and person i are acquainted. (This graph is 
symmetric.) Even though n is measured in billions, the diameter of this acquaintance 
graph is thought to be quite small, perhaps 6 or 7. In other words, any two people 
on earth can be connected though a set of 6 or 7 (or fewer) acquaintances. This idea, 
originally conjectured in the 1920s, is sometimes called six degrees of separation. 


Matrix exponential. You may know that for any real number a, the sequence (1 + a/k)* 
converges as k — oo to the exponential of a, denoted expa or e°. The matrix exponential 
of a square matrix A is defined as the limit of the matrix sequence (I + A/k)* as k > oo. 
(It can shown that this sequence always converges.) The matrix exponential arises in 
many applications, and is covered in more advanced courses on linear algebra. 


(a) Find exp0 (the zero matrix) and exp I. 


(b) Find exp A, for A = | ; I 


Matrix equations. Consider two m x n matrices A and B. Suppose that for j = 1,...,n, 
the jth column of A is a linear combination of the first 7 columns of B. How do we 
express this as a matrix equation? Choose one of the matrix equations below and justify 
your choice. 


(a) A= GB for some upper triangular matrix G. 
(b 
(c 


(d) A = BJ for some lower triangular matrix J. 


A = BH for some upper triangular matrix H. 


) 
) A= FB for some lower triangular matrix F. 
) 


Choose one of the responses always, never, or sometimes for each of the statements 
below. ‘Always’ means the statement is always true, ‘never’ means it is never true, and 
‘Sometimes’ means it can be true or false, depending on the particular values of the matrix 
or matrices. Give a brief justification of each answer. 


(a) An upper triangular matrix has linearly independent columns. 
(b) The rows of a tall matrix are linearly dependent. 


(c) The columns of A are linearly independent, and AB = 0 for some nonzero matrix B. 


Orthogonal matrices. Let U and V be two orthogonal n x n matrices. Show that the 
matrix UV and the (2n) x (2n) matrix 


sly 


sl- 


are orthogonal. 
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10.36 Quadratic form. Suppose A is an n x n matrix and «x is an n-vector. The triple product 
x? Ax, a 1X1 matrix which we consider to be a scalar (i.e., number), is called a quadratic 
form of the vector x, with coefficient matrix A. A quadratic form is the vector analog 
of a quadratic function au?, where a and u are both numbers. Quadratic forms arise in 
many fields and applications. 


(a 
(b 


(c 


(d 


Show that xT Ax = D 


iJ=i AijTiTj. 


Show that 27(A™)a = x7 Ax. In other words, the quadratic form with the trans- 


posed coefficient matrix has the same value for any x. Hint. Take the transpose of 
the triple product «7 Ax. 


Show that «7((A + A7)/2)a = x? Ax. In other words, the quadratic form with 
coefficient matrix equal to the symmetric part of a matrix (i.e., (A + A”)/2) has 
the same value as the original quadratic form. 


Express 227 — 34122 — x3 as a quadratic form, with symmetric coefficient matrix A. 


10.37 Orthogonal 2 x 2 matrices. In this problem, you will show that every 2 x 2 orthogonal 
matrix is either a rotation or a reflection (see §7.1). 


(a) Let 


a b 
be an orthogonal 2 x 2 matrix. Show that the following equations hold: 


atc =1, +d? =1, ab+ cd = 0. 


(b) Define s = ad — bc. Combine the three equalities in part (a) to show that 


js| = 1, b = —sc, d= sa. 


(c) Suppose a = cos 0. Show that there are two possible matrices Q: A rotation (coun- 


terclockwise over 0 radians), and a reflection (through the line that passes through 
the origin at an angle of 0/2 radians with respect to horizontal). 


10.38 Orthogonal matrix with nonnegative entries. Suppose the n x n matrix A is orthogonal, 
and all of its entries are nonnegative, i.e., Ai; > 0 for i,j =1,...,n. Show that A must 
be a permutation matrix, i.e., each entry is either 0 or 1, each row has exactly one entry 
with value one, and each column has exactly one entry with value one. (See page 132.) 


10.39 Gram matriz and QR factorization. Suppose the matrix A has linearly independent 
columns and QR factorization A = QR. What is the relationship between the Gram 
matrix of A and the Gram matrix of R? What can you say about the angles between the 
columns of A and the angles between the columns of R? 


10.40 QR factorization of first i columns of A. Suppose the n x k matrix A has QR factorization 
A=QR. We define the n x i matrices 


Ay= [a1 «> ae |), Qis[ a 4 ae] 
for i = 1,...,k. Define the i x i matrix R; as the submatrix of R containing its first i 
rows and columns, for i = 1,...,k. Using index range notation, we have 


Aj = Ai:n,1:i, Qi — Aunts Ri —_ Risa: 


Show that A; = Q;R; is the QR factorization of A;. This means that when you compute 
the QR factorization of A, you are also computing the QR factorization of all submatrices 


Pe 


ie 
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10.42 


10.43 


10.44 


Clustering via k-means as an approximate matrix factorization. Suppose we run the 


k-means algorithm on the N n-vectors 21,...,x2N, to obtain the group representatives 
21,...,2p~. Define the matrices 
X=[m1 © an ], Z=[u = 2N]: 


X has size n x N and Z has size n x k. We encode the assignment of vectors to groups 
by the k x N clustering matrix C, with Ci; = 1 if x; is assigned to group i, and Ci; = 0 
otherwise. Each column of C is a unit vector; its transpose is a selector matrix. 


(a) Give an interpretation of the columns of the matrix X — ZC, and the squared norm 
(matrix) norm ||X — ZC||?. 


(b) Justify the following statement: The goal of the k-means algorithm is to find an 
n x k matrix Z, and a k x N matrix C, which is the transpose of a selector matrix, 
so that ||X — ZC|| is small, i.e., X ~ ZC. 


A matrix-vector multiplication Ax of an n x n matrix A and an n-vector x takes 2n? flops 
in general. Formulate a faster method, with complexity linear in n, for matrix-vector 
multiplication with the matrix A = I + ab’, where a and b are given n-vectors. 


A particular computer takes about 0.2 seconds to multiply two 1500 x 1500 matrices. 
About how long would you guess the computer takes to multiply two 3000 x 3000 matrices? 
Give your prediction (i.e., the time in seconds), and your (very brief) reasoning. 


Complexity of matrix quadruple product. (See page 182.) We wish to compute the product 
E = ABCD, where A is m x n, B is n x p, C is p x q, and Disq xr. 


(a) Find all methods for computing E using three matrix-matrix multiplications. For 
example, you can compute AB, CD, and then the product (AB)(CD). Give the 
total number of flops required for each of these methods. Hint. There are four other 
methods. 


(b) Which method requires the fewest flops, with dimensions m = 10, n = 1000, p = 10, 
q = 1000, r = 100? 


11.1 


Chapter 11 


Matrix inverses 


In this chapter we introduce the concept of matrix inverse. We show how matrix 
inverses can be used to solve linear equations, and how they can be computed using 
the QR factorization. 


Left and right inverses 


Recall that for a number a, its (multiplicative) inverse is the number x for which 
xa = 1, which we usually denote as x = 1/a or (less frequently) z = a~'. The 
inverse x exists provided a is nonzero. For matrices the concept of inverse is more 
complicated than for scalars; in the general case, we need to distinguish between 
left and right inverses. We start with the left inverse. 


Left inverse. A matrix X that satisfies 
XA=I 
is called a left inverse of A. The matrix A is said to be left-invertible if a left 


inverse exists. Note that if A has size m x n, a left inverse X will have size n x m, 
the same dimensions as A’. 


Examples. 


e If A is a number (i.e., a 1 x 1 matrix), then a left inverse X is the same as 
the inverse of the number. In this case, A is left-invertible whenever A is 
nonzero, and it has only one left inverse. 


e Any nonzero n-vector a, considered as an n x 1 matrix, is left-invertible. For 
any index i with a; Æ 0, the row n-vector x = (1/a;)e? satisfies ra = 1. 


e The matrix 
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has two different left inverses: 


put —11 —10 16 1| 0 -1 6 
9 7 8 -li |’ 0 1 —4 |’ 

This can be verified by checking that BA = CA = I. The example illustrates 

that a left-invertible matrix can have more than one left inverse. (In fact, if it 

has more than one left inverse, then it has infinitely many; see exercise 11.1.) 


e A matrix A with orthonormal columns satisfies ATA = I, so it is left- 
invertible; its transpose AT is a left inverse. 


Left-invertibility and column independence. If A has a left inverse C then the 
columns of A are linearly independent. To see this, suppose that Ax = 0. Multi- 
plying on the left by a left inverse C, we get 


0 = C(Ax) = (CA)a = Ix =x, 


which shows that the only linear combination of the columns of A that is 0 is the 
one with all coefficients zero. 

We will see below that the converse is also true; a matrix has a left inverse if 
and only if its columns are linearly independent. So the generalization of ‘a number 
has an inverse if and only if it is nonzero’ is ‘a matrix has a left inverse if and only 
if its columns are linearly independent’. 


Dimensions of left inverses. Suppose the m x n matrix A is wide, i.e., Mm < n. 
By the independence-dimension inequality, its columns are linearly dependent, and 
therefore it is not left-invertible. Only square or tall matrices can be left-invertible. 


Solving linear equations with a left inverse. Suppose that Ax = b, where A is 
an m x n matrix and zx is an n-vector. If C is a left inverse of A, we have 


Cb = C(Az) = (CA)ax = Iz = x, 


which means that x = Cb is a solution of the set of linear equations. The columns 
of A are linearly independent (since it has a left inverse), so there is only one 
solution of the linear equations Ax = b; in other words, x = Cb is the solution of 
Ax = b. 

Now suppose there is no « that satisfies the linear equations Ax = b, and 
let C be a left inverse of A. Then x = Cb does not satisfy Az = b, since no 
vector satisfies this equation by assumption. This gives a way to check if the linear 
equations Ax = b have a solution, and to find one when there is one, provided we 
have a left inverse of A. We simply test whether A(Cb) = b. If this holds, then we 
have found a solution of the linear equations; if it does not, then we can conclude 
that there is no solution of Ax = b. 

In summary, a left inverse can be used to determine whether or not a solution of 
an over-determined set of linear equations exists, and when it does, find the unique 
solution. 
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Right inverse. Now we turn to the closely related concept of right inverse. A 
matrix X that satisfies 
AX =I 


is called a right inverse of A. The matrix A is right-invertible if a right inverse 
exists. Any right inverse has the same dimensions as A’. 


Left and right inverse of matrix transpose. If A has a right inverse B, then 
BT is a left inverse of AT, since BTAT = (AB)? = I. If A has a left inverse C, 
then CT is a right inverse of AT, since ATOT = (CA)? = I. This observation 
allows us to map all the results for left-invertibility given above to similar results 
for right-invertibility. Some examples are given below. 


e A matrix is right-invertible if and only if its rows are linearly independent. 


e A tall matrix cannot have a right inverse. Only square or wide matrices can 
be right-invertible. 


Solving linear equations with a right inverse. Consider the set of m linear equa- 
tions in n variables Ax = b. Suppose A is right-invertible, with right inverse B. 
This implies that A is square or wide, so the linear equations Ax = b are square or 
under-determined. 

Then for any m-vector b, the n-vector x = Bb satisfies the equation Ax = b. 
To see this, we note that 


Ax = A(Bb) = (AB)b = Ib = b. 


We can conclude that if A is right-invertible, then the linear equations Ax = b can 
be solved for any vector b. Indeed, x = Bb is a solution. (There can be other 
solutions of Az = b; the solution x = Bb is simply one of them.) 

In summary, a right inverse can be used to find a solution of a square or under- 
determined set of linear equations, for any vector b. 


Examples. Consider the matrix appearing in the example above on page 199, 


—3 —4 
A= 4 6 
1 1 


and the two left inverses 


1| — = 1 = 
11 —10 al C= k 1 ule 


a= 7 8 -11 53| o0 1 -4 


e The over-determined linear equations Ax = (1,—2,0) have the unique solu- 
tion z = (1,—1), which can be obtained from either left inverse: 


x = B(1, —2,0) = C(1, —2,0). 


e The over-determined linear equations Ax = (1, —1,0) do not have a solution, 
since x = C (1, —1, 0) = (1/2, —1/2) does not satisfy Ax = (1, —1, 0). 
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e The under-determined linear equations A? y = (1,2) has (different) solutions 
BY (1,2) = (1/3, 2/3,38/9), CT(1,2) = (0, 1/2, —1). 


(Recall that BT and OT are both right inverses of AT.) We can find a solution 
of ATy = b for any vector b. 


Left and right inverse of matrix product. Suppose A and D are compatible for 
the matrix product AD (i.e., the number of columns in A is equal to the number 
of rows in D.) If A has a right inverse B and D has a right inverse E, then EB is 
a right inverse of AD. This follows from 


(AD)(EB) = A(DE)B = A(IB) = AB = 1. 


If A has a left inverse C and D has a left inverse F, then FC is a left inverse 
of AD. This follows from 


(FC)(AD) = F(CA)D = FD =1. 


Inverse 


If a matrix is left- and right-invertible, then the left and right inverses are unique 
and equal. To see this, suppose that AX = I and YA = I, i.e., X is any right 
inverse and Y is any left inverse of A. Then we have 


X =(YA)X =Y(AX) =Y, 


i.e., any left inverse of A is equal to any right inverse of A. This implies that the 
left inverse is unique: If we have AX = I, then the argument above tells us that 
X =Y, so we have X = X, i.e., there is only one right inverse of A. A similar 
argument shows that Y (which is the same as X) is the only left inverse of A. 
When a matrix A has both a left inverse Y and a right inverse X, we call the 
matrix X = Y simply the inverse of A, and denote it as A~!. We say that A is 
invertible or nonsingular. A square matrix that is not invertible is called singular. 


Dimensions of invertible matrices. Invertible matrices must be square, since tall 
matrices are not right-invertible, while wide matrices are not left-invertible. A 
matrix A and its inverse (if it exists) satisfy 


AAT! = AA =]. 


If A has inverse A~!, then the inverse of AT! is A; in other words, we have 
(A~!)~! = A. For this reason we say that A and A`! are inverses (of each other). 
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Solving linear equations with the inverse. Consider the square system of n linear 
equations with n variables, Ax = b. If A is invertible, then for any n-vector b, 


x= Atb (11.1) 


is a solution of the equations. (This follows since A`! is a right inverse of A.) 
Moreover, it is the only solution of Ax = b. (This follows since AT! is a left inverse 
of A.) We summarize this very important result as 


The square system of linear equations Ax = b, with A invertible, has 
the unique solution x = A~'b, for any n-vector b. 


One immediate conclusion we can draw from the formula (11.1) is that the 
solution of a square set of linear equations is a linear function of the right-hand 
side vector b. 


Invertibility conditions. For square matrices, left-invertibility, right-invertibility, 
and invertibility are equivalent: If a matrix is square and left-invertible, then it is 
also right-invertible (and therefore invertible) and vice-versa. 

To see this, suppose A is an n x n matrix and left-invertible. This implies that 
the n columns of A are linearly independent. Therefore they form a basis and so 
any n-vector can be expressed as a linear combination of the columns of A. In 
particular, each of the n unit vectors e; can be expressed as e; = Ab; for some 
n-vector b;. The matrix B = | bı bz ++- bn | satisfies 


AB =| Ab; Abs eaa Ab, | =[ & €2 ra ëa | 


So B is a right inverse of A. 
We have just shown that for a square matrix A, 


left-invertibility == column independence => right-invertibility. 


(The symbol => means that the left-hand condition implies the right-hand condi- 
tion.) Applying the same result to the transpose of A allows us to also conclude 
that 


right-invertibility == row independence => _left-invertibility. 


So all six of these conditions are equivalent; if any one of them holds, so do the 
other five. 
In summary, for a square matrix A, the following are equivalent. 


e A is invertible. 

e The columns of A are linearly independent. 
e The rows of A are linearly independent. 

e A has a left inverse. 


e A has a right inverse. 
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Examples. 


The identity matrix I is invertible, with inverse J~! = J, since IJ = I. 


A diagonal matrix A is invertible if and only if its diagonal entries are nonzero. 
The inverse of an n x n diagonal matrix A with nonzero diagonal entries is 


Laie 0 ase» ig 
gba], “Meee “2 
0 0 1/Ann 
since 
Ay /Au 0 ne 0 
i 0 A22/A22 => 0 
As cae si 
0 0 s+ Ann/Ann 


In compact notation, we have 


diag(Ay1,..., Ann)’ = diag(Aj',..., Ap). 


? nn 


Note that the inverse on the left-hand side of this equation is the matrix 
inverse, while the inverses appearing on the right-hand side are scalar inverses. 


As a non-obvious example, the matrix 


1 -2 3 
A= 0 2 2 
-3 -4 —4 
is invertible, with inverse 
0 —20 -10 
A™ => | -6 5 —2 
We io 2 


This can be verified by checking that AA~' = I (or that A~!A = J, since 
either of these implies the other). 


2x2 matrices. A 2x 2 matrix A is invertible if and only if A11 A22 Æ A12Aa1, 
with inverse 


gial Ar Ae) i An -An 
Ao, A22 Aj, A229 — Ajg2Ao, | —A21 An | 


(There are similar formulas for the inverse of a matrix of any size, but they 
grow very quickly in complexity and so are not very useful in most applica- 
tions.) 


Orthogonal matrix. If A is square with orthonormal columns, we have AT A = 
I, so A is invertible with inverse A~! = AT. 
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Inverse of matrix transpose. If A is invertible, its transpose A” is also invertible 
and its inverse is (A~!)?: 

(AT) = (AF, 
Since the order of the transpose and inverse operations does not matter, this matrix 
is sometimes written as ATT. 


Inverse of matrix product. If A and B are invertible (hence, square) and of the 
same size, then AB is invertible, and 


(AB) ha AT) (11.2) 


The inverse of a product is the product of the inverses, in reverse order. 


Dual basis. Suppose that A is invertible with inverse B = A~+. Let a1,..., an be 
the columns of A, and b7,...,b7 denote the rows of B, i.e., the columns of BT: 
bf 


A=[ a nok än: Is B= 


We know that a1,...,an form a basis, since the columns of A are linearly inde- 
pendent. The vectors b1, ...,bn also form a basis, since the rows of B are linearly 
independent. They are called the dual basis of a1,...,an. (The dual basis of 
b1,...,6n iS @1,...,@n, So they called dual bases.) 
Now suppose that x is any n-vector. It can be expressed as a linear combination 
of the basis vectors a1,...,@n: 
z= Bay +--+ + Bran. 
The dual basis gives us a simple way to find the coefficients 3,,..., Bn. 
We start with AB = I, and multiply by x to get 
bi 
xz=ABr=| a © an] | : |x= (b7r)a +--+ (bf 2)an. 
bT 
n 


This means (since the vectors a1, ...,aņ are linearly independent) that 8; = bTx. 
In words: The coefficients in the expansion of a vector in a basis are given by the 
inner products with the dual basis vectors. Using matrix notation, we can say that 
B = BTx = (A~!)’z is the vector of coefficients of x in the basis given by the 
columns of A. 

As a simple numerical example, consider the basis 


a, = (1,1), a> = (1,-1). 
The dual basis consists of the rows of [ a} az |~*, which are 
bT =| 1/2 1/2], B8=[12 172]. 
To express the vector x = (—5,1) as a linear combination of a; and az, we have 
x = (bT x)aı + (bF x)az = (—2)a, + (—3)a2, 


which can be directly verified. 
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Negative matrix powers. We can now give a meaning to matrix powers with 
negative integer exponents. Suppose A is a square invertible matrix and k is a 
positive integer. Then by repeatedly applying property (11.2), we get 


(AF)! = (ATHE. 


We denote this matrix as A~*. For example, if A is square and invertible, then 
A-? = AtA! = (AA)—!. With A? defined as A? = I, the identity A*+! = AF A! 
holds for all integers k and l. 


Triangular matrix. A triangular matrix with nonzero diagonal elements is invert- 
ible. We first discuss this for a lower triangular matrix. Let L be n x n and lower 
triangular with nonzero diagonal elements. We show that the columns are linearly 
independent, i.e., Lx = 0 is only possible if x = 0. Expanding the matrix-vector 
product, we can write Lx = 0 as 


Lizi = 0 

Loixı + Lo2%2 = 0 

L3ixı + L32%2 + L33%3 = 0 

Enix ate Ln2X2 TIST LrnAtn-i IE Lynnin = 0. 


Since L11 Æ 0, the first equation implies xı = 0. Using zı = 0, the second equation 
reduces to L29%2 = 0. Since Lə2 Æ 0, we conclude that x2 = 0. Using xı = £2 = 0, 
the third equation now reduces to [L33%3 = 0, and since L33 is assumed to be 
nonzero, we have x3 = 0. Continuing this argument, we find that all entries of x 
are zero, and this shows that the columns of L are linearly independent. It follows 
that L is invertible. 

A similar argument can be followed to show that an upper triangular matrix 
with nonzero diagonal elements is invertible. One can also simply note that if R 
is upper triangular, then L = RT is lower triangular with the same diagonal, and 
use the formula (L7)~! = (L~1)? for the inverse of the transpose. 


Inverse via QR factorization. The QR factorization gives a simple expression 
for the inverse of an invertible matrix. If A is square and invertible, its columns 
are linearly independent, so it has a QR factorization A = QR. The matrix Q is 
orthogonal and R is upper triangular with positive diagonal entries. Hence Q and 
R are invertible, and the formula for the inverse product gives 


A~! = (Oh) = RIQ = RHQ. (11.3) 


In the following section we give an algorithm for computing R~', or more 
directly, the product R~'!Q7. This gives us a method to compute the matrix 
inverse. 
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Solving linear equations 


Back substitution. We start with an algorithm for solving a set of linear equa- 
tions, Rx = b, where the n x n matrix R is upper triangular with nonzero diagonal 
entries (hence, invertible). We write out the equations as 


Rizzi + Rygte +--+ + Rin—-18n-1 + Rintn = by 
Rn 2,n—2Tn—2 a Rn 2,n—1Tn—1 T Rn 2ntn = bn—2 
Rn 1n—1%n—-1 T Rn Intn = bui 

Rinn: = dn. 


From the last equation, we find that £n = bn/Rnn. Now that we know zn, we 
substitute it into the second to last equation, which gives us 


Top = (bn 1 — Rn in&n)/Rn—1,n-1- 


We can continue this way to find £n-2, £n-3,;...-, £1. This algorithm is known as 
back substitution, since the variables are found one at a time, starting from zn, and 
we substitute the ones that are known into the remaining equations. 


Algorithm 11.1 BACK SUBSTITUTION 


given an n X n upper triangular matrix R with nonzero diagonal entries, and an 


n-vector b. 
Fori=n,...,1, 
Ti = (bi = Ri i+1£i+1 a Pee Rintn)/ Ris. 


(In the first step, with i = n, we have £n = bn/Rnn.) The back substitution 
algorithm computes the solution of Rx = b, i.e., 2 = R~‘b. It cannot fail since the 
divisions in each step are by the diagonal entries of R, which are assumed to be 
nonzero. 

Lower triangular matrices with nonzero diagonal elements are also invertible; 
we can solve equations with lower triangular invertible matrices using forward sub- 
stitution, the obvious analog of the algorithm given above. In forward substitution, 
we find x, first, then x2, and so on. 


Complexity of back substitution. The first step requires 1 flop (division by Rn). 
The next step requires one multiply, one subtraction, and one division, for a total 
of 3 flops. The kth step requires k — 1 multiplies, k — 1 subtractions, and one 
division, for a total of 2k — 1 flops. The total number of flops for back substitution 
is then 

1+3+5+ + (2n-1) =n 


flops. 
This formula can be obtained from the formula (5.7), or directly derived using 
a similar argument. Here is the argument for the case when n is even; a similar 
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argument works when n is odd. Lump the first entry in the sum together with 
the last entry, the second entry together with the second-to-last entry, and so on. 
Each of these pairs add up to 2n; since there are n/2 such pairs, the total is 
(n/2)(2n) = n?. 


Solving linear equations using the QR factorization. The formula (11.3) for the 
inverse of a matrix in terms of its QR factorization suggests a method for solving 
a square system of linear equations Ax = b with A invertible. The solution 


x = ate eo se Oa (11.4) 


can be found by first computing the matrix-vector product y = QTb, and then 
solving the triangular equation Ra = y by back substitution. 


Algorithm 11.2 SOLVING LINEAR EQUATIONS VIA QR FACTORIZATION 


given an n x n invertible matrix A and an n-vector b. 


1. QR factorization. Compute the QR factorization A = QR. 
2. Compute QTb. 


3. Back substitution. Solve the triangular equation Rx = Q*b using back substi- 
tution. 


The first step requires 2n? flops (see §5.4), the second step requires 2n? flops, 
and the third step requires n? flops. The total number of flops is then 


2n? + 3n? = 2n3, 


so the order is n?, cubic in the number of variables, which is the same as the number 
of equations. 

In the complexity analysis above, we found that the first step, the QR factor- 
ization, dominates the other two; that is, the cost of the other two is negligible 
in comparison to the cost of the first step. This has some interesting practical 
implications, which we discuss below. 


Factor-solve methods. Algorithm 11.2 is similar to many methods for solving a 
set of linear equations and is sometimes referred to as a factor-solue scheme. A 
factor-solve scheme consists of two steps. In the first (factor) step the coefficient 
matrix is factored as a product of matrices with special properties. In the second 
(solve) step one or more linear equations that involve the factors in the factorization 
are solved. (In algorithm 11.2, the solve step consists of steps 2 and 3.) The 
complexity of the solve step is smaller than the complexity of the factor step, and 
in many cases, it is negligible by comparison. This is the case in algorithm 11.2, 
where the factor step has order n? and the solve step has order n?. 


Factor-solve methods with multiple right-hand sides. Now suppose that we must 
solve several sets of linear equations, 


Ax; = bı, sey ATk = bk, 
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all with the same coefficient matrix A, but different right-hand sides. We can 
express this as the matrix equation AX = B, where X is the n x k matrix with 
columns z1,..., £k, and B is then xk matrix with columns by,..., bẹ (see page 180). 
Assuming A is invertible, the solution of AX = B is X = A~!B. 

A naive way to solve the k problems Ax; = b; (or in matrix notation, compute 
X = A~'B) is to apply algorithm 11.2 k times, which costs 2kn flops. A more 
efficient method exploits the fact that A is the same matrix in each problem, so 
we can re-use the matrix factorization in step 1 and only need to repeat steps 2 
and 3 to compute #, = R7'Q™by for 1 =1,...,k. (This is sometimes referred to 
as factorization caching, since we save or cache the factorization after carrying it 
out, for later use.) The cost of this method is 2n3 + 3kn? flops, or approximately 
2n? flops if k <n. The (surprising) conclusion is that we can solve multiple sets of 
linear equations, with the same coefficient matrix A, at essentially the same cost 
as solving one set of linear equations. 


Backslash notation. In several software packages for manipulating matrices, A\b 
is taken to mean the solution of Ax = b, i.e., A~'b, when A is invertible. This 
backslash notation is extended to matrix right-hand sides: A\B, with B an n x k 
matrix, denotes A~!B, the solution of the matrix equation AX = B. (The compu- 
tation is implemented as described above, by factoring A just once, and carrying 
out k back substitutions.) This backslash notation is not standard mathematical 
notation, however, so we will not use it in this book. 


Computing the matrix inverse. We can now describe a method to compute the 
inverse B = A`! of an (invertible) n x n matrix A. We first compute the QR 
factorization of A, so A~' = R~!Q?T. We can write this as RB = QT, which, 
written out by columns is 


where b; is the ith column of B and 4; is the ith column of QT. We can solve these 
equations using back substitution, to get the columns of the inverse B. 


Algorithm 11.3 COMPUTING THE INVERSE VIA QR FACTORIZATION 


given an n x n invertible matrix A. 


1. QR factorization. Compute the QR factorization A = QR. 


2. Fori=1,...,n, 
Solve the triangular equation Rb; = qi using back substitution. 


The complexity of this method is 2n° flops (for the QR factorization) and n? for 
n back substitutions, each of which costs n? flops. So we can compute the matrix 
inverse in around 3n? flops. 

This gives an alternative method for solving the square set of linear equations 
Ax = b: We first compute the inverse matrix A~!, and then the matrix-vector 
product x = (A~!)b. This method has a higher flop count than directly solving 
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the equations using algorithm 11.2 (3n? versus 2n3), so algorithm 11.2 is the usual 
method of choice. While the matrix inverse appears in many formulas (such as the 
solution of a set of linear equations), it is computed far less often. 


Sparse linear equations. Systems of linear equations with sparse coefficient ma- 
trix arise in many applications. By exploiting the sparsity of the coefficient matrix, 
these linear equations can be solved far more efficiently than by using the generic 
algorithm 11.2. One method is to use the same basic algorithm 11.2, replacing 
the QR factorization with a variant that handles sparse matrices (see page 190). 
The memory usage and complexity of these methods depends in a complicated 
way on the sparsity pattern of the coefficient matrix. In order, the memory usage 
is typically a modest multiple of nnz(A) + n, the number of scalars required to 
specify the problem data A and b, which is typically much smaller than n? +n, the 
number of scalars required to store A and b if they are not sparse. The flop count 
for solving sparse linear equations is also typically closer in order to nnz(A) than 
n3, the order when the matrix A is not sparse. 


Examples 


Polynomial interpolation. The 4-vector c gives the coefficients of a cubic polyno- 


mial, 
p(x) = cy + coe + caz? + cyr? 


(see pages 154 and 120). We seek the coefficients that satisfy 


We can express this as the system of 4 equations in 4 variables Ac = b, where 


1 —1.1 (—1.1)? (-1.1)8 
Au|i 704 (—0.4)? (—0.4)3 

“14 02 (0.22 (028 |? 
1 0.8 (0.8)? (0.8) 


which is a specific Vandermonde matrix (see (6.7)). The unique solution is c = 
Atb, where 


—0.5784 1.9841 —2.1368 0.7310 
0.3470 0.1984 —1.4957 0.9503 
0.1388 —1.8651 1.6239 0.1023 

—0.0370 0.3492 0.7521 —0.0643 


At= 


(to 4 decimal places). This is illustrated in figure 11.1, which shows the two cu- 
bic polynomials that interpolate the two sets of points shown as filled circles and 
squares, respectively. 

The columns of A~! are interesting: They give the coefficients of a polynomial 
that evaluates to 0 at three of the points, and 1 at the other point. For example, the 
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p(x) 


1.5 1 0.5 0 0.5 1 


Figure 11.1 Cubic interpolants through two sets of points, shown as circles 
and squares. 


first column of A~!, which is A~'e,, gives the coefficients of the polynomial that 
has value 1 at —1.1, and value 0 at —0.4, 0.2, and 0.8. The four polynomials with 
coefficients given by the columns of A~! are called the Lagrange polynomials asso- 
ciated with the points —1.1, —0.4, 0.2, 0.8. These are plotted in figure 11.2. (The 
Lagrange polynomials are named after the mathematician Joseph-Louis Lagrange, 
whose name will re-appear in several other contexts.) 

The rows of AT! are also interesting: The ith row shows how the values b1, 
..., b4, the polynomial values at the points —1.1, —0.4, 0.2, 0.8, map into the ith 
coefficient of the polynomial, c;. For example, we see that the coefficient c4 is not 
very sensitive to the value of bı (since (A~)41 is small). We can also see that for 
each increase of one in b4, the coefficient c2 increases by around 0.95. 


Balancing chemical reactions. (See page 154 for background.) We consider the 
problem of balancing the chemical reaction 


a,Cr203— + agFe2+ + a3H* — bı Cr3t + boFe?+ + b3H20, 
where the superscript gives the charge of each reactant and product. There are 4 


atoms (Cr, O, Fe, H) and charge to balance. The reactant and product matrices 
are (using the order just listed) 


20 0 io « 
700 001 
R=| 010l, P=/|01 0 
001 00 2 
=) 2 1 3 3 0 
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Figure 11.2 Lagrange polynomials associated with the points —1.1, —0.4, 
0.2, 0.8. 
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Imposing the condition that a, = 1 we obtain a square set of 6 linear equations, 


2 0 0 -1 0 0 ay 0 
700 0 0 -1 a2 0 
0 10 0 -i 0 az | |0 
0 0 1 0 0 -2 b | 10 
—2 2 1 -3 -3 0 be 0 
100 0 0 0 bs 1 


Solving these equations we obtain 
ay = I; ag = 6, a3 = 14, by = 2; bo = 6, b3 = "1. 


(Setting a; = 1 could have yielded fractional values for the other coefficients, but 
in this case, it did not.) The balanced reaction is 


Cr2027 + 6Fe?* + 14Ht — 2Cr3* + 6Fe?t + 7H20. 


Heat diffusion. We consider a diffusion system as described on page 155. Some of 
the nodes have fixed potential, i.e., e; is given; for the other nodes, the associated 
external source s; is zero. This would model a thermal system in which some 
nodes are in contact with the outside world or a heat source, which maintains 
their temperatures (via external heat flows) at constant values; the other nodes are 
internal, and have no heat sources. This gives us a set of n additional equations: 


ei = eÊ*, LEP, si=0, i¢€P, 


where P is the set of indices of nodes with fixed potential. We can write these n 
equations in matrix-vector form as 


Bs+Ce=d, 
where B and C are the n x n diagonal matrices, and d is the n-vector given by 
_f 0 ieP f1 iEeP _f & ieP 
Ba={$ ig P, gi ig P, a={ o igp. 


We assemble the flow conservation, edge flow, and the boundary conditions into 
one set of m + 2n equations in m + 2n variables (f, s, e): 


AI 0 f 0 
R 0 AT s |=] 0 
0B C e d 


(The matrix A is the incidence matrix of the graph, and R is the resistance matrix; 
see page 155.) Assuming the coefficient matrix is invertible, we have 
=a 


f A I 0 0 
s |=] R 0 AT 0 
e 0B C d 


This is illustrated with an example in figure 11.3. The graph is a 100 x 100 grid, 
with 10000 nodes, and edges connecting each node to its horizontal and vertical 
neighbors. The resistance on each edge is the same. The nodes at the top and 
bottom are held at zero temperature, and the three sets of nodes with rectilinear 
shapes are held at temperature one. All other nodes have zero source value. 
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Figure 11.3 Temperature distribution on a 100 x 100 grid of nodes. Nodes 
in the top and bottom rows are held at zero temperature. The three sets of 
nodes with rectilinear shapes are held at temperature one. 


Pseudo-inverse 


Linearly independent columns and Gram invertibility. We first show that an 
m x n matrix A has linearly independent columns if and only if its n x n Gram 
matrix ATA is invertible. 

First suppose that the columns of A are linearly independent. Let x be an 
n-vector which satisfies (AT A)x = 0. Multiplying on the left by zT we get 


0 = 270 = zT (AT Az) = 27 AT Ax = || Aa”, 


which implies that Az = 0. Since the columns of A are linearly independent, we 
conclude that x = 0. Since the only solution of (AT A)z = 0 is x = 0, we conclude 
that ATA is invertible. 

Now let’s show the converse. Suppose the columns of A are linearly dependent, 
which means there is a nonzero n-vector x which satisfies Ax = 0. Multiply on the 
left by A? to get (AT A)x = 0. This shows that the Gram matrix ATA is singular. 


Pseudo-inverse of square or tall matrix. We show here that if A has linearly 
independent columns (and therefore, is square or tall) then it has a left inverse. 
(We already have observed the converse, that a matrix with a left inverse has 
linearly independent columns.) Assuming A has linearly independent columns, we 
know that ATA is invertible. We now observe that the matrix (A7A)~!A? is a 
left inverse of A: 


((ATA)-1AT) A = (ATA)“1(ATA) =T. 


This particular left-inverse of A will come up in the sequel, and has a name, 
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the pseudo-inverse of A. It is denoted At (or At): 
Al = (ATA) HAT. (11.5) 


The pseudo-inverse is also called the Moore-Penrose inverse, after the mathemati- 
cians Eliakim Moore and Roger Penrose. 
When A is square, the pseudo-inverse At reduces to the ordinary inverse: 


Ai =(ATA)1AT = A™HATTAT = ATH = A~. 


Note that this equation does not make sense (and certainly is not correct) when A 
is not square. 


Pseudo-inverse of a square or wide matrix. Transposing all the equations, we 
can show that a (square or wide) matrix A has a right inverse if and only if its 
rows are linearly independent. Indeed, one right inverse is given by 


AT(AATY E. (11.6) 


(The matrix AAT is invertible if and only if the rows of A are linearly independent.) 

The matrix in (11.6) is also referred to as the pseudo-inverse of A, and denoted 
At. The only possible confusion in defining the pseudo-inverse using the two dif- 
ferent formulas (11.5) and (11.6) occurs when the matrix A is square. In this case, 
however, they both reduce to the ordinary inverse: 


AT(AAT) T = AT ATTAT! = A~. 


Pseudo-inverse in other cases. The pseudo-inverse AŤ is defined for any matrix, 
including the case when A is tall but its columns are linearly dependent, the case 
when A is wide but its rows are linearly dependent, and the case when A is square 
but not invertible. In these cases, however, it is not a left inverse, right inverse, or 
inverse, respectively. We mention it here since the reader may encounter it. (We 
will see what AT means in these cases in exercise 15.11.) 


Pseudo-inverse via QR factorization. The QR factorization gives a simple for- 
mula for the pseudo-inverse. If A is left-invertible, its columns are linearly inde- 
pendent and the QR factorization A = QR exists. We have 


ATA=(QR)" (QR) = R'Q7’QR= RR, 


sO 
At = (ATA Ar = (RTR)! (QR) = RoR RO” — RQF. 


We can compute the pseudo-inverse using the QR factorization, followed by back 
substitution on the columns of QT. (This is exactly the same as algorithm 11.3 
when A is square and invertible.) The complexity of this method is 2n?m flops (for 
the QR factorization), and mn? flops for the m back substitutions. So the total is 
3mn? flops. 
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Similarly, if A is right-invertible, the QR factorization AT = QR of its transpose 
exists. We have AAT = (QR)? (QR) = RTQTQR = RTR and 


A aA (AAO = QR(RT Rh) =aORR A = QRT. 
We can compute it using the method described above, using the formula 
(AT)Ì = (A). 


Solving over- and under-determined systems of linear equations. The pseudo- 
inverse gives us a method for solving over-determined and under-determined sys- 
tems of linear equations, provided the columns of the coefficient matrix are linearly 
independent (in the over-determined case), or the rows are linearly independent 
(in the under-determined case). If the columns of A are linearly independent, and 
the over-determined equations Ax = b have a solution, then x = Atb is it. If the 
rows of A are linearly independent, the under-determined equations Ax = b have 
a solution for any vector b, and z = Atb is a solution. 


Numerical example. We illustrate these ideas with a simple numerical example, 
using the 3 x 2 matrix A used in earlier examples on pages 199 and 201, 


—3 —4 
A= 4 6 
1 1 


This matrix has linearly independent columns, and QR factorization with (to 4 
digits) 


—0.5883 0.4576 
Q=| 0.7845 0.5230 |, R= | i ee | 
0.1961 —0.7191 l 


It has pseudo-inverse (to 4 digits) 


At = RQ? = 


—1.2222 —1.1111 1.7778 
0.7778 0.8889 —1.2222 |` 


We can use the pseudo-inverse to check if the over-determined systems of equations 
Az = b, with b = (1, —2,0), has a solution, and to find a solution if it does. We 
compute x = At(1,—2,0) = (1,—1) and check whether Ax = b holds. It does, so 
we have found the unique solution of Ax = b. 
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Exercises 


Affine combinations of left inverses. Let Z be a tall mxn matrix with linearly independent 
columns, and let X and Y be left inverses of Z. Show that for any scalars a and £ satisfying 
a+6=1,aX+6Y is also a left inverse of Z. It follows that if a matrix has two different 
left inverses, it has an infinite number of different left inverses. 


Left and right inverses of a vector. Suppose that x is a nonzero n-vector with n > 1. 


(a) Does x have a left inverse? 


(b) Does x have a right inverse? 


In each case, if the answer is yes, give a left or right inverse; if the answer is no, give a 
specific nonzero vector and show that it is not left- or right-invertible. 


Matrix cancellation. Suppose the scalars a, x, and y satisfy ax = ay. When a 4 0 we 
can conclude that x = y; that is, we can cancel the a on the left of the equation. In this 
exercise we explore the matrix analog of cancellation, specifically, what properties of A 
are needed to conclude X = Y from AX = AY, for matrices A, X, and Y? 

(a) Give an example showing that A Æ 0 is not enough to conclude that X = Y. 

(b) Show that if A is left-invertible, we can conclude from AX = AY that X =Y. 

(c) Show that if A is not left-invertible, there are matrices X and Y with X # Y, and 

AX = AY. 

Remark. Parts (b) and (c) show that you can cancel a matrix on the left when, and only 
when, the matrix is left-invertible. 


Transpose of, orthogonal matrix. Let U be an orthogonal n x n matrix. Show that its 
transpose U~ is also orthogonal. 


Inverse of a block matrix. Consider the (n + 1) x (n + 1) matrix 


where a is an n-vector. 


(a) When is A invertible? Give your answer in terms of a. Justify your answer. 


b) Assuming the condition you found in part (a) holds, give an expression for the inverse 
8 y g 
matrix A7}. 


Inverse of a block upper triangular matriz. Let B and D be invertible matrices of sizes 
mxm and n x n, respectively, and let C be any m x n matrix. Find the inverse of 


a 


in terms of B7}, C, and D™+. (The matrix A is called block upper triangular.) 


Hints. First get an idea of what the solution should look like by considering the case 
when B, C, and D are scalars. For the matrix case, your goal is to find matrices W, X, 
Y, Z (in terms of B~', C, and D~') that satisfy 


Use block matrix multiplication to express this as a set of four matrix equations that you 
can then solve. The method you will find is sometimes called block back substitution. 
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Inverse of an upper triangular matrix. Suppose the n x n matrix R is upper triangular 
and invertible, i.e., its diagonal entries are all nonzero. Show that R~' is also upper 
triangular. Hint. Use back substitution to solve Rs, = ex, for k = 1,...,n, and argue 
that (sx); = 0 for i > k. 


If a matriz is small, its inverse is large. If a number a is small, its inverse 1/a (assuming 
a # 0) is large. In this exercise you will explore a matrix analog of this idea. Suppose the 
n x n matrix A is invertible. Show that ||A~’|| > //7/||Al|. This implies that if a matrix 
is small, its inverse is large. Hint. You can use the inequality ||AB|| < ||Al||| Bl], which 
holds for any matrices for which the product makes sense. (See exercise 10.12.) 


Push-through identity. Suppose A is m x n, B isn x m, and the m x m matrix I+ AB 
is invertible. 


(a) Show that the n x n matrix I + BA is invertible. Hint. Show that (I + BA)x = 0 
implies (I + AB)y = 0, where y = Az. 


(b) Establish the identity 
B(I + AB) = (I+ BA)'B. 


This is sometimes called the push-through identity since the matrix B appearing on 
the left ‘moves’ into the inverse, and ‘pushes’ the B in the inverse out to the right 
side. Hint. Start with the identity 


B(I + AB) = (I + BA)B, 
and multiply on the right by (I + AB)~', and on the left by (I+ BA)™+. 
Reverse-time linear dynamical system. A linear dynamical system has the form 
T1441 = Art, 


where z+ in the (n-vector) state in period t, and A is the n x n dynamics matrix. This 
formula gives the state in the next period as a function of the current state. 
We want to derive a recursion of the form 


rev 
Tt- =Å Tt, 


which gives the previous state as a function of the current state. We call this the reverse 
time linear dynamical system. 


(a) When is this possible? When it is possible, what is AY? 


(b) For the specific linear dynamical system with dynamics matrix 


3 2 
aafaa 


find A*°’, or explain why the reverse time linear dynamical system doesn’t exist. 
Interpolation of rational functions. (Continuation of exercise 8.8.) Find a rational function 


cy + cot + cat? 


t= 
PO) 1+ dit + dot? 


that satisfies the following interpolation conditions: 


fA) =2, f(2) =5, f(3) = 9, f(A) =—l, f(5) = —4. 


In exercise 8.8 these conditions were expressed as a set of linear equations in the coefficients 
C1, C2, C3, dı and dz; here we are asking you to form and (numerically) solve the system 
of equations. Plot the rational function you find over the range x = 0 to x = 6. Your 
plot should include markers at the interpolation points (1,2),...,(5,—4). (Your rational 
function graph should pass through these points.) 
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Combinations of invertible matrices. Suppose the n x n matrices A and B are both 
invertible. Determine whether each of the matrices given below is invertible, without any 
further assumptions about A and B. 


(a) A+B. 

A 0 
(b is a 

A A+B 
(c 0 B : 
(d) ABA. 


Another left inverse. Suppose the m x n matrix A is tall and has linearly independent 
columns. One left inverse of A is the pseudo-inverse A‘. In this problem we explore 
another one. Write A as the block matrix 


. Ay 
a[i] 


where A; is n x n. We assume that A; is invertible (which need not happen in general). 
Show that the following matrix is a left inverse of A: 


elar orani 


Middle inverse. Suppose A is an n x p matrix and B is a q x n matrix. If a p x q matrix 
X exists that satisfies AXB = I, we call it a middle inverse of the pair A, B. (This is 
not a standard concept.) Note that when A or B is an identity matrix, the middle inverse 
reduces to the right or left inverse, respectively. 


(a) Describe the conditions on A and B under which a middle inverse X exists. Give 
your answer using only the following four concepts: Linear independence of the rows 
or columns of A, and linear independence of the rows or columns of B. You must 
justify your answer. 


(b) Give an expression for a middle inverse, assuming the conditions in part (a) hold. 


Invertibility of population dynamics matrix. Consider the population dynamics matrix 


by be ht bog bio0 
1-d 0 tee 0 0 

A= 0 l-dz ::: 0 0 f 
0 0 «++ L—dgg 0 


where b; > 0 are the birth rates and 0 < d; < 1 are death rates. What are the conditions 
on b; and d; under which A is invertible? (If the matrix is never invertible or always 
invertible, say so.) Justify your answer. 


Inverse of running sum matrix. Find the inverse of the n x n running sum matrix, 


1 0 0 0 
11 0 0 
s= ; 
1 1 0 
1 1 1 


Does your answer make sense? 
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A matrix identity. Suppose A is a square matrix that satisfies A? = 0 for some integer k. 
(Such a matrix is called nilpotent.) A student guesses that (I— A)~' = I+ A+---+ A*?, 
based on the infinite series 1/(1 — a) = 1 +a +a? +---, which holds for numbers a that 
satisfy |a| < 1. 

Is the student right or wrong? If right, show that her assertion holds with no further 
assumptions about A. If she is wrong, give a counterexample, i.e., a matrix A that 
satisfies A? = 0, but J+ A+---+ AFT! is not the inverse of I — A. 


Tall-wide product. Suppose A is an n x p matrix and B is a p x n matrix, so C = AB 
makes sense. Explain why C cannot be invertible if A is tall and B is wide, t.e., if p < n. 
Hint. First argue that the columns of B must be linearly dependent. 


Control restricted to one time period. A linear dynamical system has the form x41 = 
Axt + ut, where the n-vector x+ is the state and uz is the input at time t. Our goal is to 
choose the input sequence u1,...,un—1 so as to achieve zy = «°°, where x1% is a given 
n-vector, and N is given. The input sequence must satisfy uz = 0 unless t = K, where 
K < N is given. In other words, the input can only act at time t = K. Give a formula 
for ux that achieves this goal. Your formula can involve A, N, K, 21, and z%. You 
can assume that A is invertible. Hint. First derive an expression for xx, then use the 
dynamics equation to find «+1. From zg+ı you can find zy. 


Immigration. The population dynamics of a country is given by 41 = Ax, + u, t = 
1,...,T — 1, where the 100-vector x; gives the population age distribution in year t, 
and u gives the immigration age distribution (with negative entries meaning emigration), 
which we assume is constant (i.e., does not vary with t). You are given A, 71, and °°, a 
100-vector that represents a desired population distribution in year T. We seek a constant 
level of immigration u that achieves xr = ets, 

Give a matrix formula for u. If your formula only makes sense when some conditions hold 
(for example invertibility of one or more matrices), say so. 


Quadrature weights. Consider a quadrature problem (see exercise 8.12) with n = 4, with 
points t = (—0.6, —0.2,0.2,0.6). We require that the quadrature rule be exact for all 
polynomials of degree up to d = 3. 

Set this up as a square system of linear equations in the weight vector. Numerically solve 
this system to get the weights. Compute the true value and the quadrature estimate, 


a= f f(x) dz, â = wı f(—0.6) + w2 f(—0.2) + w3 f (0.2) + w4 f (0.6), 


for the specific function f(x) = e”. 


Properties of pseudo-inverses. For an m x n matrix A and its pseudo-inverse At, show 
that A = AAŤA and At = AŤ AAT in each of the following cases. 


(a) A is tall with linearly independent columns. 
(b) A is wide with linearly independent rows. 


(c) A is square and invertible. 


Product of pseudo-inverses. Suppose A and D are right-invertible matrices and the prod- 
uct AD exists. We have seen that if B is a right inverse of A and F is a right inverse of 
D, then EB is a right inverse of AD. Now suppose B is the pseudo-inverse of A and EF is 
the pseudo-inverse of D. Is EB the pseudo-inverse of AD? Prove that this is always true 
or give an example for which it is false. 


11.24 Simultaneous left inverse. The two matrices 


=. nne w 
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and both left-invertible, and have multiple left inverses. Do they have a common left 
inverse? Explain how to find a 2 x 4 matrix C that satisfies CA = CB = I, or determine 
that no such matrix exists. (You can use numerical computing to find C.) Hint. Set up 
a set of linear equations for the entries of C. Remark. There is nothing special about the 
particular entries of the two matrices A and B. 


Checking the computed solution of linear equations. One of your colleagues says that 
whenever you compute the solution x of a square set of n equations Ax = b (say, using 
QR factorization), you should compute the number || Az—6|| and check that it is small. (It 
is not exactly zero due to the small rounding errors made in floating point computations. ) 
Another colleague says that this would be nice to do, but the additional cost of computing 
|| Ax — b|| is too high. Briefly comment on your colleagues’ advice. Who is right? 


Sensitivity of solution of linear equations. Let A be an invertible n x n matrix, and b and 
x be n-vectors satisfying Ax = b. Suppose we perturb the jth entry of b by € 4 0 (which 


is a traditional symbol for a small quantity), so b becomes b=b+ cej. Let & be the n- 


vector that satisfies AZ = b, i.e., the solution of the linear equations using the perturbed 
right-hand side. We are interested in ||a — @||, which is how much the solution changes 
due to the change in the right-hand side. The ratio ||x — &||/|e| gives the sensitivity of the 
solution to changes (perturbations) of the jth entry of b. 


(a) Show that ||2 — || does not depend on b; it only depends on the matrix A, €, and j. 


(b) How would you find the index j that maximizes the value of ||x — ||? By part (a), 
your answer should be in terms of A (or quantities derived from A) and « only. 


Remark. If a small change in the right-hand side vector b can lead to a large change in the 
solution, we say that the linear equations Ax = b are poorly conditioned or ill-conditioned. 
As a practical matter it means that unless you are very confident in what the entries of 
b are, the solution A~‘b may not be useful in practice. 


Timing test. Generate a random n x n matrix A and an n-vector b, for n = 500, n = 1000, 
and n = 2000. For each of these, compute the solution x = A~‘b (for example using the 
backslash operator, if the software you are using supports it), and verify that Ax — b is 
(very) small. Report the time it takes to solve each of these three sets of linear equations, 
and for each one work out the implied speed of your processor in Gflop/s, based on the 
2n? complexity of solving equations using the QR factorization. 


Solving multiple linear equations efficiently. Suppose the n x n matrix A is invertible. We 
can solve the system of linear equations Ax = b in around 2n? flops using algorithm 11.2. 
Once we have done that (specifically, computed the QR factorization of A), we can solve an 
additional set of linear equations with same matrix but different right-hand side, Ay = c, 
in around 3n? additional flops. Assuming we have solved both of these sets of equations, 
suppose we want to solve Az = d, where d = ab + 8c is a linear combination of b and c. 
(We are given the coefficients a and 3.) Suggest a method for doing this that is even 
faster than re-using the QR factorization of A. Your method should have a complexity 
that is linear in n. Give rough estimates for the time needed to solve Ax = b, Ay = c, 
and Az = d (using your method) for n = 3000 on a computer capable of carrying out 1 
Gflop/s. 
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Chapter 12 


Least squares 


In this chapter we look at the powerful idea of finding approximate solutions of 
over-determined systems of linear equations by minimizing the sum of the squares 
of the errors in the equations. The method, and some extensions we describe in 
later chapters, are widely used in many application areas. It was discovered inde- 
pendently by the mathematicians Carl Friedrich Gauss and Adrien-Marie Legendre 
around the beginning of the 19th century. 


Least squares problem 


Suppose that the m x n matrix A is tall, so the system of linear equations Ax = b, 
where b is an m-vector, is over-determined, i.e., there are more equations (m) 
than variables to choose (n). These equations have a solution only if b is a linear 
combination of the columns of A. 

For most choices of b, however, there is no n-vector x for which Ax = b. Asa 
compromise, we seek an x for which r = Ax — b, which we call the residual (for the 
equations Ax = b), is as small as possible. This suggests that we should choose x 
so as to minimize the norm of the residual, || Az — b||. If we find an x for which the 
residual vector is small, we have Ax ~ b, i.e., x almost satisfies the linear equations 
Ax =b. (Some authors define the residual as b— Ax, which will not affect us since 
|| Ax — b|| = |b — Az.) 

Minimizing the norm of the residual and its square are the same, so we can just 
as well minimize 

lAr — bl? = |r? = ri +--+ rm 
the sum of squares of the residuals. The problem of finding an n-vector ĉ that 
minimizes || Ax — b||?, over all possible choices of x, is called the least squares 
problem. It is denoted using the notation 


minimize || Ax — b||?, (12.1) 


where we should specify that the variable is x (meaning that we should choose x). 
The matrix A and the vector b are called the data for the problem (12.1), which 
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means that they are given to us when we are asked to choose x. The quantity to 
be minimized, || Ax — b||?, is called the objective function (or just objective) of the 
least squares problem (12.1). 

The problem (12.1) is sometimes called linear least squares to emphasize that 
the residual r (whose norm squared we are to minimize) is an affine function of z, 
and to distinguish it from the nonlinear least squares problem, in which we allow 
the residual r to be an arbitrary function of x. We will study the nonlinear least 
squares problem in chapter 18. 

Any vector ĉ that satisfies || A¢—b||? < || Av—b||? for all x is a solution of the least 
squares problem (12.1). Such a vector is called a least squares approximate solution 
of Ax = b. It is very important to understand that a least squares approximate 
solution ĉ of Ax = b need not satisfy the equations Aĉ = b; it simply makes the 
norm of the residual as small as it can be. Some authors use the confusing phrase 
‘¢ solves Ax = b in the least squares sense’, but we emphasize that a least squares 
approximate solution ĉ does not, in general, solve the equation Az = b. 

If || Aĉ — b|| (which we call the optimal residual norm) is small, then we can say 
that @ approximately solves Ax = b. On the other hand, if there is an n-vector x 
that satisfies Ax = b, then it is a solution of the least squares problem, since its 
associated residual norm is zero. 

Another name for the least squares problem (12.1), typically used in data fitting 
applications (the topic of the next chapter), is regression. We say that ĉ, a solution 
of the least squares problem, is the result of regressing the vector b onto the columns 
of A. 


Column interpretation. If the columns of A are the m-vectors a1,...,an, then 
the least squares problem (12.1) is the problem of finding a linear combination of 
the columns that is closest to the m-vector b; the vector x gives the coefficients: 


Ax — b|? = | (z101 + +++ + znan) — Ol. 
If ĉ is a solution of the least squares problem, then the vector 
Aĉ = £10, +- + Ênan 
is closest to the vector b, among all linear combinations of the vectors a1,..., dn. 


Row interpretation. Suppose the rows of A are the n-row-vectors @7,...,a7,, so 
the residual components are given by 


Ti =Q perag 


The least squares objective is then 


|| Ax — bl]? = (af x — b1)? + + (Ga — bm)’, 
the sum of the squares of the residuals in m scalar linear equations. Minimizing 
this sum of squares of the residuals is a reasonable compromise if our goal is to 
choose x so that all of them are small. 
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Example. We consider the least squares problem with data 


b= 0 
—1 


2 0 
A=]| -1 1], 

0 2 
The over-determined set of three equations in two variables Ax = b, 


2x1 = 1, —zı + z2 =0, 242 = —1, 


has no solution. (From the first equation we have x; = 1/2, and from the last 
equation we have x2 = —1/2; but then the second equation does not hold.) The 
corresponding least squares problem is 


minimize (22, — 1)? + (—2 + x2)? + (2x2 + 1)”. 


This least squares problem can be solved using the methods described in the next 
section (or simple calculus). Its unique solution is ê = (1/3,—1/3). The least 
squares approximate solution < does not satisfy the equations Ax = b; the corre- 
sponding residuals are 


= Aĉ — b = (—1/3, —2/3, 1/3), 


=> 


with sum of squares value || Aĉ — b||? = 2/3. Let us compare this to another choice 
of x, & = (1/2,—1/2), which corresponds to (exactly) solving the first and last of 
the three equations in Ax = b. It gives the residual 


F= A — b = (0, —1,0), 


with sum of squares value || AZ — b||? = 1. 
The column interpretation tells us that 


2 0 2/3 
(1/3) | -1 | +(-1/3) | 1 | = | -2/3 
0 2 —2/3 


is the linear combination of the columns of A that is closest to b. 

Figure 12.1 shows the values of the least squares objective || Aa — b||? versus 
x = (#1,%2), with the least squares solution @ shown as the dark point, with 
objective value || Aĉ — b||? = 2/3. The curves show the points x that have objective 
value || Aĉ — b||? +1, || Aĉ — b||? + 2, and so on. 


Solution 


In this section we derive several expressions for the solution of the least squares 
problem (12.1), under one assumption on the data matrix A: 


The columns of A are linearly independent. (12.2) 
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Figure 12.1 Level curves of the function || Ax — b||? = (2x1 — 1)? + (—z1 4 
x2)" + (2x2 + 1)’. The point ĉ minimizes the function. 


Solution via calculus. In this section we find the solution of the least squares 
problem using some basic results from calculus, reviewed in §C.2. (We will also 
give an independent verification of the result, that does not rely on calculus, below.) 
We know that any minimizer ĉ of the function f(x) = || Ax — b||? must satisfy 


of 
Ox; 


(2)=0, i=1,...,n, 


which we can express as the vector equation 
Vf) = 0, 


where V f(ĉ) is the gradient of f evaluated at ĉ. The gradient can be expressed in 
matrix form as 


V f(z) = 2A? (Az — b). (12.3) 


This formula can be derived from the chain rule given on page 184, and the gradient 
of the sum of squares function, given in §C.1. For completeness, we will derive the 
formula (12.3) from scratch here. Writing the least squares objective out as a sum, 
we get 


2 
m n 


f(x) = ||Ax — bl? = X Aij£j — bi 
1 


i=1 \j= 
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To find V f(x), we take the partial derivative of f with respect to xp. Differenti- 
ating the sum term by term, we get 


of 
Ox, (x) 


i=l j=l 


Vi (xk = 


m 
= X 2(A)gi(Ax — b); 
i=1 
= (2A7(Az-—b)),. 
This is our formula (12.3), written out in terms of its components. 
Now we continue the derivation of the solution of the least squares problem. 
Any minimizer ĉ of || Ax — b||? must satisfy 


Vf(ĉ) = 2AT (Aĉ — b) = 0, 


which can be written as 

AT Aĉ = ATD. (12.4) 
These equations are called the normal equations. The coefficient matrix ATA is 
the Gram matrix associated with A; its entries are inner products of columns of A. 


Our assumption (12.2) that the columns of A are linearly independent implies 
that the Gram matrix ATA is invertible (§11.5, page 214). This implies that 


& = (AT A)~1ATO (12.5) 


is the only solution of the normal equations (12.4). So this must be the unique 
solution of the least squares problem (12.1). 

We have already encountered the matrix (A7.A)~1A” that appears in (12.5): It 
is the pseudo-inverse of the matrix A, given in (11.5). So we can write the solution 
of the least squares problem in the simple form 


& = Ato. (12.6) 


We observed in §11.5 that AŤ is a left inverse of A, which means that ¢ = Atb 
solves Ax = b if this set of over-determined equations has a solution. But now 
we see that ĉ = Ab is the least squares approximate solution, i.e., it minimizes 
|| Ax — b||?. (And if there is a solution of Az = b, then = Atb is it.) 

The equation (12.6) looks very much like the formula for solution of the linear 
equations Ax = b, when A is square and invertible, i.e., £ = A~'b. It is very 
important to understand the difference between the formula (12.6) for the least 
squares approximate solution, and the formula for the solution of a square set 
of linear equations, x = A~'b. In the case of linear equations and the inverse, 
x = A~'Dd actually satisfies Ax = b. In the case of the least squares approximate 
solution, ĉ = Atb generally does not satisfy Aĉ = b. 

The formula (12.6) shows us that the solution ĉ of the least squares problem 
is a linear function of b. This generalizes the fact that the solution of a square 
invertible set of linear equations is a linear function of its right-hand side. 
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Direct verification of least squares solution. In this section we directly show 
that ĉ = (AT A)~!A7 is the solution of the least squares problem (12.1), without 
relying on calculus. We will show that for any x 4 ĉ, we have 


|| Ae — bll? < || Ax — bll’, 


establishing that ĉ is the unique vector that minimizes || Ax — 6||?. 
We start by writing 


|| Ax — bll? ||(Aw — Aĉ) + (Aĉ — b)||? 


= ||Ax — Aĉ||? + || Aĉ — b||? + 2(Ax — Aĉ)T (Aĉ — b), (12.7) 
where we use the identity 
lu + ul? = (u +v)" (u + v) = lull? + lol? + 2u7v. 
The third term in (12.7) is zero: 


(Ax — A&)7(A@—b) = (a —%)7AT(AZ —b) 
= (x—#)" (A? Ad — ATD) 
(a 


where we use (A? A)# = ATb (the normal equations) in the third line. With this 
simplification, (12.7) reduces to 


|| Ax — b||? = || A(x — 2) |]? + Aê — bll’. 
The first term on the right-hand side is nonnegative and therefore 
|| Ax — b||? > || Az — |]. 


This shows that ĉ minimizes || Ax—b||?; we now show that it is the unique minimizer. 
Suppose equality holds above, that is, || Ax — b||? = || Aĉ — b||?.. Then we have 
|| A(a — &)||? = 0, which implies A(x — 2) = 0. Since A has linearly independent 
columns, we conclude that z — # = 0, i.e., x = %. So the only z with || Ax — b||? = 
|| Aĉ — b||? is x = ĉ; for all x 4 ĉ, we have || Az — b||? > || Aĉ — b]|?. 


Row form. The formula for the least squares approximate solution can be ex- 
pressed in a useful form in terms of the rows a7 of the matrix A. 


ĉ = (AT A)-1ATb = ps vat (È ba ; (12.8) 


In this formula we express the n x n Gram matrix ATA as a sum of m outer 
products, and the n-vector ATb as a sum of m n-vectors. 
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Figure 12.2 Illustration of orthogonality principle for a least squares problem 
of size m = 3, n = 2. The optimal residual 7 is orthogonal to any linear 
combination of ai and ag, the two columns of A. 


Orthogonality principle. The point Az is the linear combination of the columns 
of A that is closest to b. The optimal residual is ê = Aĉ — b. The optimal 
residual satisfies a property that is sometimes called the orthogonality principle: 
It is orthogonal to the columns of A, and therefore, it is orthogonal to any linear 
combination of the columns of A. In other words, for any n-vector z, we have 


(Az) L°. (12.9) 


We can derive the orthogonality principle from the normal equations, which can 
be expressed as AT (Aĉ — b) = 0. For any n-vector z, we have 


(Az)? # = (Az)? (Aĉ — b) = 27 AT (Aĉ — b) = 0. 


The orthogonality principle is illustrated in figure 12.2, for a least squares prob- 
lem with m = 3 and n = 2. The shaded plane is the set of all linear combinations 
2141 + 22Q2 of a, and ag, the two columns of A. The point Aĉ is the closest point 
in the plane to b. The optimal residual f is shown as the vector from b to Aĉ. This 
vector is orthogonal to any point in the shaded plane. 


Solving least squares problems 


We can use the QR factorization to compute the least squares approximate so- 
lution (12.5). Let A = QR be the QR factorization of A (which exists by our 
assumption (12.2) that its columns are linearly independent). We have already 
seen that the pseudo-inverse At can be expressed as At = R~!Q™, so we have 


ĉ = RIQTO. (12.10) 


To compute ĉ we first multiply b by QT; then we compute R~!(Q™b) using back 
substitution. This is summarized in the following algorithm, which computes the 
least squares approximate solution ĉ, given A and b. 
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Algorithm 12.1 LEAST SQUARES VIA QR FACTORIZATION 

given an m x n matrix A with linearly independent columns and an m-vector b. 
1. QR factorization. Compute the QR factorization A = QR. 
2. Compute QTd. 


3. Back substitution. Solve the triangular equation Rê = QTd. 


Comparison to solving a square system of linear equations. Recall that the 
solution of the square invertible system of linear equations Az = b is x = A7'b. 
We can express x using the QR factorization of A as x = R~!Q™b (see (11.4)). This 
equation is formally identical to (12.10). The only difference is that in (12.10), A 
and Q need not be square, and R~!Q*b is the least squares approximate solution, 
which is not (in general) a solution of Ax = b. 

Indeed, algorithm 12.1 is formally the same as algorithm 11.2, the QR fac- 
torization method for solving linear equations. (The only difference is that in 
algorithm 12.1, A and Q can be tall.) 

When A is square, solving the linear equations Ax = b and the least squares 
problem of minimizing ||Ax — b||? are the same, and algorithm 11.2 and algo- 
rithm 12.1 are the same. So we can think of algorithm 12.1 as a generalization of 
algorithm 11.2, which solves the equation Ax = b when A is square, and computes 
the least squares approximate solution when A is tall. 


Backslash notation. Several software packages for manipulating matrices extend 
the backslash operator (see page 209) to mean the least squares approximate solu- 
tion of an over-determined set of linear equations. In these packages A\b is taken 
to mean the solution A~'b of Ax = b when A is square and invertible, and the 
least squares approximate solution Atb when A is tall and has linearly indepen- 
dent columns. (We remind the reader that this backslash notation is not standard 
mathematical notation.) 


Complexity. The complexity of the first step of algorithm 12.1 is 2mn? flops. The 
second step involves a matrix-vector multiplication, which takes 2mn flops. The 
third step requires n? flops. The total number of flops is 


2mn? + 2mn + n? ~ 2mn?, 


neglecting the second and third terms, which are smaller than the first by factors 
of n and 2m, respectively. The order of the algorithm is mn?. The complexity is 
linear in the row dimension of A and quadratic in the number of variables. 


Sparse least squares. Least squares problems with sparse A arise in several appli- 
cations and can be solved more efficiently, for example by using a QR factorization 
tailored for sparse matrices (see page 190) in the generic algorithm 12.1. 
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Another simple approach for exploiting sparsity of A is to solve the normal 
equations AT Aĉ = ATb by solving a larger (but sparse) system of equations, 


0 AT ĉ 0 
É M E (12.11) 
This is a square set of m+n linear equations. Its coefficient matrix is sparse when A 
is sparse. If (4, ) satisfies these equations, it is easy to see that ĉ satisfies (12.11); 
conversely, if ĉ satisfies the normal equations, (%, 7%) satisfies (12.11) with g = 


b— Aĉ. Any method for solving a sparse system of linear equations can be used to 
solve (12.11). 


Matrix least squares. A simple extension of the least squares problem is to choose 
the n x k matriz X so as to minimize ||AX — B||?. Here A is an m x n matrix and 
B is an m x k matrix, and the norm is the matrix norm. This is sometimes called 
the matrix least squares problem. When k = 1, x and b are vectors, and the matrix 
least squares problem reduces to the usual least squares problem. 

The matrix least squares problem is in fact nothing but a set of k ordinary least 
squares problems. To see this, we note that 


2 


2 


|| AX — B|? = ||Azı — bıl? +--+ + || Axx — brl 


where x; is the jth column of X and b; is the jth column of B. (Here we use the 
property that the square of the matrix norm is the sum of the squared norms of 
the columns of the matrix.) So the objective is a sum of k terms, with each term 
depending on only one column of X. It follows that we can choose the columns zx; 
independently, each one by minimizing its associated term || Ax; — b;||?. Assuming 
that A has linearly independent columns, the solution is ĉj = A'b;. The solution 
of the matrix least squares problem is therefore 


X = [& + & | 
= [| Atb Atbp, | 
= A [b br | 
= A'B. (12.12) 


The very simple solution X = AtB of the matrix least squares problem agrees with 
the solution of the ordinary least squares problem when k = 1 (as it must). Many 
software packages for linear algebra use the backslash operator A\B to denote AŤ B, 
but this is not standard mathematical notation. 

The matrix least squares problem can be solved efficiently by exploiting the fact 
that algorithm 12.1 is another example of a factor-solve algorithm. To compute 
X = AŤB we carry out the QR factorization of A once; we carry out steps 2 and 3 of 
algorithm 12.1 for each of the k columns of B. The total cost is 2mn? +k(2mn +n?) 
flops. When k is small compared to n this is roughly 2mn? flops, the same cost as 
solving a single least squares problem (i.e., one with a vector right-hand side). 
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12.4 


Examples 


Advertising purchases. We have m demographic groups or audiences that we 
want to advertise to, with a target number of impressions or views for each group, 
which we give as a vector v35, (The entries are positive.) To reach these audiences, 
we purchase advertising in n different channels (say, different web publishers, radio, 
print, ...), in amounts that we give as an n-vector s. (The entries of s are non- 
negative, which we ignore.) The m x n matrix R gives the number of impressions 
in each group per dollar spending in the channels: Rij is the number of impres- 
sions in group į per dollar spent on advertising in channel j. (These entries are 
estimated, and are nonnegative.) The jth column of R gives the effectiveness or 
reach (in impressions per dollar) for channel j. The ith row of R shows which 
media demographic group 7 is exposed to. The total number of impressions in each 
demographic group is the m-vector v, which is given by v = Rs. The goal is to 
find s so that v = Rs ~ v9, We can do this using least squares, by choosing 
s to minimize ||Rs — v*s||?. (We are not guaranteed that the resulting channel 
spend vector will be nonnegative.) This least squares formulation does not take 
into account the total cost of the advertising; we will see in chapter 16 how this 
can be done. 

We consider a simple numerical example, with n = 3 channels and m = 10 
demographic groups, and matrix 


0.97 1.86 0.41 
1.23 218 0.53 
0.80 1.24 0.62 
1.29 0.98 0.51 
pa |1410 1.23 0.69 
0.67 0.34 0.54 |’ 
0.87 0.26 0.62 
1.10 0.16 0.48 
1.92 0.22 0.71 
1.29 0.12 0.62 


with units of 1000 views per dollar. The entries of R range over an 18:1 range, so 

the 3 channels are quite different in terms of their audience reach; see figure 12.3. 
We take v@°s = (10°)1, i.e., our goal is to reach one million customers in each of 

the 10 demographic groups. Least squares gives the advertising budget allocation 


3 = (62, 100, 1443), 


which achieves a views vector with RMS error 132, or 13.2% of the target values. 
The views vector is shown in figure 12.4. 


Iumination. A set of n lamps illuminates an area that we divide into m regions 
or pixels. We let l; denote the lighting level in region 7, so the m-vector l gives the 
illumination levels across all regions. We let p; denote the power at which lamp 
i operates, so the n-vector p gives the set of lamp powers. (The lamp powers are 
nonnegative and also must not exceed a maximum allowed power, but we ignore 
these issues here.) 
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Figure 12.3 Number of impressions in ten demographic groups, per dollar 
spent on advertising in three channels. The units are 1000 views per dollar. 
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Figure 12.4 Views vector that best approximates the target 
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The vector of illumination levels is a linear function of the lamp powers, so we 

have | = Ap for some m x n matrix A. The jth column of A gives the illumination 
pattern for lamp j, i.e., the illumination when lamp j has power 1 and all other 
lamps are off. We will assume that A has linearly independent columns (and 
therefore is tall). The ith row of A gives the sensitivity of pixel i to the n lamp 
powers. 
The goal is to find lamp powers that result in a desired illumination pattern 
, such as [4°S = a1, which is uniform illumination with value a across the area. 
In other words, we seek p so that Ap ~ 1¢°S. We can use least squares to find # that 
minimizes the sum square deviation from the desired illumination, || Ap — 13%°3]|?. 
This gives the lamp power levels 


[des 


P — Aljdes = ASAT. 


(We are not guaranteed that these powers are nonnegative, or less than the maxi- 
mum allowed power level.) 

An example is shown in figure 12.5. The area is a 25 x 25 grid with m = 625 
pixels, each (say) 1m square. The lamps are at various heights ranging from 3m 
to 6m, and at the positions shown in the figure. The illumination decays with an 
inverse square law, so Ajj is proportional to ae where dj; is the (3-D) distance 
between the center of the pixel and the lamp position. The matrix A is scaled so 
that when all lamps have power one, the average illumination level is one. The 
desired illumination pattern is 1, i.e., uniform with value 1. 

With p = 1, the resulting illumination pattern is shown in the top part of 
figure 12.5. The RMS illumination error is 0.24. We can see that the corners are 
quite a bit darker than the center, and there are pronounced bright spots directly 
beneath each lamp. Using least squares we find the lamp powers 


p = (1.46, 0.79, 2.97, 0.74, 0.08, 0.21, 0.21, 2.05, 0.91, 1.47). 


The resulting illumination pattern has an RMS error of 0.14, about half of the 
RMS error with all lamp powers set to one. The illumination pattern is shown in 
the bottom plot of figure 12.5; we can see that the illumination is more uniform 
than when all lamps have power 1. Most illumination values are near the target 
level 1, with the corners a bit darker and the illumination a bit brighter directly 
below each lamp, but less so than when all lamps have power one. This is clear 
from figure 12.6, which shows the histogram of patch illumination values for all 
lamp powers one, and for lamp powers p. 
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25m 


25m 


0 25m 


Figure 12.5 A square area divided in a 25 x 25 grid. The circles show the 
positions of 10 lamps, the number in parentheses next to each circle is the 
height of the lamp. The top plot shows the illumination pattern with lamps 
set to power one. The bottom plot shows the illumination pattern for the 
lamp powers that minimize the sum square deviation with a desired uniform 
illumination of one. 
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Figure 12.6 Histograms of pixel illumination values using p = 1 (top) and p 
(bottom). The target intensity value is one. 
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Exercises 


Approximating a vector as a multiple of another one. In the special case n = 1, the general 
least squares problem (12.1) reduces to finding a scalar x that minimizes ||aa— ||”, where 
a and b are m-vectors. (We write the matrix A here in lower case, since it is an m-vector.) 
Assuming a and b are nonzero, show that ||aĉ — b||? = ||b||?(sin@)°, where @ = Z(a,b). 
This shows that the optimal relative error in approximating one vector by a multiple of 
another one depends on their angle. 


Least squares with orthonormal columns. Suppose the m x n matrix Q has orthonormal 
columns and b is an m-vector. Show that # = Q7b is the vector that minimizes ||Qa — b||”. 
What is the complexity of computing ĉ, given Q and b, and how does it compare to the 
complexity of a general least squares problem with an m x n coefficient matrix? 


Least angle property of least squares. Suppose the mxn matrix A has linearly independent 


columns, and b is an m-vector. Let ĉ = Atb denote the least squares approximate solution 
of Ax =b. 


(a) Show that for any n-vector x, (Ax)"b = (Ax)"(A®), i.e., the inner product of Ax 
and b is the same as the inner product of Ax and Aĉ. Hint. Use (Ax)"b = x7 (ATb) 
and (AT A)é = Ab. 

(b) Show that when Aĉ and b are both nonzero, we have 

(Az)"b _ ||Aa 


Aliel lèl 


The left-hand side is the cosine of the angle between Aĉ and b. Hint. Apply part (a) 
with xz = ĉ. 

(c) Least angle property of least squares. The choice x = ê minimizes the distance 
between Ax and b. Show that x = ĉ also minimizes the angle between Az and b. 
(You can assume that Ar and b are nonzero.) Remark. For any positive scalar a, 
x = az also minimizes the angle between Az and b. 


Weighted least squares. In least squares, the objective (to be minimized) is 
|| Aa — bll? = X āe — by)’, 
i=l 


where @ are the rows of A, and the n-vector x is to chosen. In the weighted least squares 


problem, we minimize the objective 


m 
Swill — h), 
j=1 


where w; are given positive weights. The weights allow us to assign different weights to the 
different components of the residual vector. (The objective of the weighted least squares 
problem is the square of the weighted norm, ||Aa — b||2,, as defined in exercise 3.28.) 

(a) Show that the weighted least squares objective can be expressed as ||D(Azx — b)]||? 
for an appropriate diagonal matrix D. This allows us to solve the weighted least 
squares problem as a standard least squares problem, by minimizing ||Bx — dll’, 
where B = DA and d= Db. 


(b) Show that when A has linearly independent columns, so does the matrix B. 


(c) The least squares approximate solution is given by ĉ = (A7.A)~'A‘b. Give a similar 
formula for the solution of the weighted least squares problem. You might want to 
use the matrix W = diag(w) in your formula. 
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12.5 


12.6 


12.7 


12.8 


12.9 


12.10 


12.11 


Approximate right inverse. Suppose the tall m x n matrix A has linearly independent 
columns. It does not have a right inverse, i.e., there is no n x m matrix X for which 
AX = I. So instead we seek the nx m matrix X for which the residual matrix R = AX —I 
has the smallest possible matrix norm. We call this matrix the least squares approximate 
right inverse of A. Show that the least squares right inverse of A is given by X = Al. 
Hint. This is a matrix least squares problem; see page 233. 


Least squares equalizer design. (See exercise 7.15.) You are given a channel impulse 
response, the n-vector c. Your job is to find an equalizer impulse response, the n-vector 
h, that minimizes ||h * c — e1||*. You can assume that cı # 0. Remark. h is called an 
equalizer since it approximately inverts, or undoes, convolution by c. 


Explain how to find h. Apply your method to find the equalizer h for the channel c = 
(1.0, 0.7, —0.3, —0.1, 0.05). Plot c, h, and h x c. 


Network tomography. A network consists of n links, labeled 1,...,n. A path through the 
network is a subset of the links. (The order of the links on a path does not matter here.) 
Each link has a (positive) delay, which is the time it takes to traverse it. We let d denote 
the n-vector that gives the link delays. The total travel time of a path is the sum of the 
delays of the links on the path. Our goal is to estimate the link delays (i.e., the vector d), 
from a large number of (noisy) measurements of the travel times along different paths. 
This data is given to you as an N x n matrix P, where 


P= 1 link j is on path i 
od 0 otherwise, 


and an N-vector t whose entries are the (noisy) travel times along the N paths. You can 
assume that N > n. You will choose your estimate d by minimizing the RMS deviation 
between the measured travel times (t) and the travel times predicted by the sum of the 
link delays. Explain how to do this, and give a matrix expression for d. If your expression 
requires assumptions about the data P or t, state them explicitly. 


Remark. This problem arises in several contexts. The network could be a computer 
network, and a path gives the sequence of communication links data packets traverse. 
The network could be a transportation system, with the links representing road segments. 


Least squares and QR factorization. Suppose A is an m x n matrix with linearly inde- 
pendent columns and QR factorization A = QR, and b is an m-vector. The vector Aĉ is 
the linear combination of the columns of A that is closest to the vector b, i.e., it is the 
projection of b onto the set of linear combinations of the columns of A. 


(a) Show that Aĉ = QQTb. (The matrix QQ” is called the projection matriz.) 


(b) Show that || Aê -— b||? = ||b||? — ||Q7b||?.. (This is the square of the distance between b 
and the closest linear combination of the columns of A.) 


Invertibility of matriz in sparse least squares formulation. Show that the (m+n) x (m+n) 
coefficient matrix appearing in equation (12.11) is invertible if and only if the columns of 
A are linearly independent. 


Numerical check of the least squares approximate solution. Generate a random 30 x 10 
matrix A and a random 30-vector b. Compute the least squares approximate solution 
& = A'b and the associated residual norm squared || Aĉ — b||?. (There may be several 
ways to do this, depending on the software package you use.) Generate three different 
random 10-vectors d1,d2,d3, and verify that || A(ĉ + di) — b||? > || Aê — b||? holds. (This 
shows that z = ĉ has a smaller associated residual than the choices x = ¢+d;, i = 1, 2,3.) 


Complexity of matrix least squares problem. Explain how to compute the matrix least 


squares approximate solution of AX = B, given by X = AB (see (12.12)), in no more 
than 2mn? + 3mnk flops. (In contrast, solving k vector least squares problems to obtain 


the columns of X , in a naive way, requires 2mn7k flops.) 
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Least squares placement. The 2-vectors pi,...,pn represent the locations or positions of 
N objects, for example, factories, warehouses, and stores. The last K of these locations 
are fixed and given; the goal in a placement problem to choose the locations of the first 
N — K objects. Our choice of the locations is guided by an undirected graph; an edge 
between two objects means we would like them to be close to each other. In least squares 
placement, we choose the locations pi,...,pN—K So as to minimize the sum of the squares 
of the distances between objects connected by an edge, 


2 2 
[Pi Pj || Pe [Dip Piz || , 
where the L edges of the graph are given by (i1, j1), ---, (iz, jL). 


(a) Let D be the Dirichlet energy of the graph, as defined on page 135. Show that the 
sum of the squared distances between the N objects can be expressed as D(u)+D(v), 
where u = ((pi)i1,.--,(pn)1) and v = ((pi)2,...,(pn)2) are N-vectors containing 
the first and second coordinates of the objects, respectively. 


(b) Express the least squares placement problem as a least squares problem, with vari- 
able x = (u1.(~—K);V1:(N—K))- In other words, express the objective above (the sum 
of squares of the distances across edges) as || Ar—b||”, for an appropriate mxn matrix 
A and m-vector b. You will find that m = 2L. Hint. Recall that D(y) = ||B7y|l?, 
where B is the incidence matrix of the graph. 


(c) Solve the least squares placement problem for the specific problem with N = 10, 
K = 4, L = 13, fixed locations 


p7 = (0,0), ps = (0,1), ps = (1,1), pro = (1,0), 

and edges 
(1,3), (1,4), (1,7), (2,3), (2,5), (2,8), (2,9), 
(3,4), (3,5), (4,6), (5,6), (6,9), (6,10). 


Plot the locations, showing the graph edges as lines connecting the locations. 


12.13 Iterative method for least squares problem. Suppose that A has linearly independent 


columns, so ĉ = Afb minimizes ||Ax — b||?. In this exercise we explore an iterative 
method, due to the mathematician Lewis Richardson, that can be used to compute ĉ. We 
define + = 0 and for k = 1,2,..., 


ght) =g) _ LAT (Aa — b), 


where p is a positive parameter, and the superscripts denote the iteration number. This 
defines a sequence of vectors that converge to ĉ provided u is not too large; the choice 
u = 1/||All?, for example, always works. The iteration is terminated when A? (Ax) — b) 
is small enough, which means the least squares optimality conditions are almost satisfied. 


To implement the method we only need to multiply vectors by A and by AT. If we have 
efficient methods for carrying out these two matrix-vector multiplications, this iterative 
method can be faster than algorithm 12.1 (although it does not give the exact solution). 
Iterative methods are often used for very large scale least squares problems. 


(a) Show that if r+) = 2), we have x) = @. 
(b) Express the vector sequence xF) as a linear dynamical system with constant dy- 
namics matrix and offset, i.e., in the form «@t) = Fa) +g. 


(c) Generate a random 20 x 10 matrix A and 20-vector b, and compute ĉ = Ab. Run 
the Richardson algorithm with u = 1/|| A||? for 500 iterations, and plot ||a) — ĉ|| 
to verify that x”) appears to be converging to ĉ. 
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12.15 


Recursive least squares. In some applications of least squares the rows of the coefficient 
matrix A become available (or are added) sequentially, and we wish to solve the resulting 
family of growing least squares problems. Define the k x n matrices and k-vectors 

by 

A”) m : = Ai:kiin; p) = : == biik, 


ag bk 


aie 


a 


for k =1,...,m. We wish to compute #) = AMt6) fork =n,n+1,...,m. We will 
assume that the columns of A‘) are linearly independent, which implies that the columns 
of A‘) are linearly independent for k = n,...,m. We will also assume that m is much 
larger than n. The naive method for computing x) requires 2kn? flops, so the total cost 
fork =n,...,m is 


5 2kn? = (>: e) (2n?) = (= £ = mE z) (2n?) = m?n? flops. 
k=n k=n 


A simple trick allows us to compute r™® fork=n... ¿m much more efficiently, with a 
cost that grows linearly with m. The trick also requires memory storage order n°, which 
does not depend on m. for k = 1,...,m, define 


G® = (AM)? A h® = (APT, 


(a) Show that 2% = (G®)Ih®) for k=n,...,m. Hint. See (12.8). 
(b) Show that GE) = G™ + apaf and h&+) = A + brag, for k =1,...,m—1. 


(c) Recursive least squares is the following algorithm. For k = n,...,m, compute GEFN 
and h(*+) using (b); then compute # using (a). Work out the total flop count for 
this method, keeping only dominant terms. (You can include the cost of computing 
G™ and h™, which should be negligible in the total.) Compare to the flop count 
for the naïve method. 


Remark. A further trick called the matrix inversion lemma (which is beyond the scope of 
this book) can be used to reduce the complexity of recursive least squares to order mn’. 


Minimizing a squared norm plus an affine function. A generalization of the least squares 
problem (12.1) adds an affine function to the least squares objective, 


minimize || Ax — b||? + c7x +d, 


where the n-vector x is the variable to be chosen, and the (given) data are the mxn matrix 
A, the m-vector b, the n-vector c, and the number d. We will use the same assumption 
we use in least squares: The columns of A are linearly independent. This generalized 
problem can be solved by reducing it to a standard least squares problem, using a trick 
called completing the square. 


Show that the objective of the problem above can be expressed in the form 


|| Ax — bl? +e? x +d=||Ax—b+ fll? +9, 


for some m-vector f and some constant g. It follows that we can solve the generalized 
least squares problem by minimizing || Ax — (b — f)||, an ordinary least squares problem 
with solution @ = At(b— f). 

Hints. Express the norm squared term on the right-hand side as ||(Ax — b) + f||? and 
expand it. Then argue that the equality above holds provided 2AT f = c. One possible 
choice is f = (1/2)(A')*c. (You must justify these statements.) 
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Gram method for computing least squares approximate solution. Algorithm 12.1 uses the 
QR factorization to compute the least squares approximate solution ĉ = A'b, where the 
m x n matrix A has linearly independent columns. It has a complexity of 2mn? flops. In 
this exercise we consider an alternative method: First, form the Gram matrix G = ATA 
and the vector h = A‘b; and then compute ¢ = Guth (using algorithm 11.2). What is 
the complexity of this method? Compare it to algorithm 12.1. Remark. You might find 
that the Gram algorithm appears to be a bit faster than the QR method, but the factor is 
not large enough to have any practical significance. The idea is useful in situations where 
G is partially available and can be computed more efficiently than by multiplying A and 
its transpose. An example is exercise 13.21. 


13.1 


Chapter 13 


Least squares data fitting 


In this chapter we introduce one of the most important applications of least squares 
methods, to the problem of data fitting. The goal is to find a mathematical model, 
or an approximate model, of some relation, given some observed data. 


Least squares data fitting 


Least squares is widely used as a method to construct a mathematical model from 
some data, experiments, or observations. Suppose we have an n-vector x, and a 
scalar y, and we believe that they are related, perhaps approximately, by some 
function f : R” > R: 


y ~ f(z). 


The vector x might represent a set of n feature values, and is called the feature 
vector or the vector of independent variables, depending on the context. The scalar 
y represents some outcome (also called response variable) that we are interested 
in. Or x might represent the previous n values of a time series, and y represents 
the next value. 


Data. We don’t know f, although we might have some idea about its general 
form. But we do have some data, given by 


where the n-vector x is the feature vector and the scalar y is the associated 

value of the outcome for data sample i. We sometimes refer to the pair «, y 

as the ith data pair. These data are also called observations, examples, samples, 

or measurements, depending on the context. Here we use the superscript (i) to 

denote ae data point: «™ is an n-vector, the ith independent variable; the 
a 


number z; is the value of jth feature for example t. 
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Model. We will form a model of the relationship between x and y, given by 
y ~ f(z), 


where f : R” > R. We write 7 = f(x), where ĝ is the (scalar) prediction (of the 
outcome y), given the independent variable (vector) x. The hat appearing over f 
is traditional notation that suggests that the function f is an approximation of the 
function f. The function f is called the model, prediction function, or predictor. 
For a specific value of the feature vector x, 9 = f(x) is the prediction of the 
outcome. 


Linear in the parameters model. We will focus on a specific form for the model, 
which has the form 


f(z) = 1 falx) asep On fp(x), 

where fi : R” — R are basis functions or feature mappings that we choose, and 
0; are the model parameters that we choose. This form of model is called linear in 
the parameters, since for each x, f(a) is a linear function of the model parameter 
p-vector 0. The basis functions are usually chosen based on our idea of what f 
looks like. (We will see many examples of this below.) Once the basis functions 
have been chosen, there is the question of how to choose the model parameters, 
given our set of data. 


Prediction error. Our goal is to choose the model f so that it is consistent with 

the data, i.e., we have y ~ f(x), for i =1,...,N. (There is another goal in 

choosing f, that we will discuss in §13.2.) For data sample i, our model predicts 

the value g = f(x), so the prediction error or residual for this data point is 
pO) = y® — g®, 

(Some authors define the prediction error in the opposite way, as gf — y. We 

will see that this does not affect the methods developed in this chapter.) 


Vector notation for outcomes, predictions, and residuals. For our data set and 
model, we have the observed response y, the prediction gj, and the residual 
or prediction error r“, for each example i = 1,...,.N. We will now use vector 
notation to express these as N-vectors, 


yt = (y®,..., y), gt = (g™,...,9%), ri = (r®,... rO), 


respectively. (In the notation used above to describe the approximate relation 
between the feature vector and the outcome, y ~ f(x), and the prediction function 
9 = f(x), the symbols y and ĝ refer to generic scalar values. With the superscript d 
(for ‘data’), y4, 94, and r€ refer to the N-vectors of observed data values, predicted 
values, and associated residuals.) 

Using this vector notation we can express the (vector of) residuals as rì = 
y* — g*. A natural measure of how well the model predicts the observed data, or 
how consistent it is with the observed data, is the RMS prediction error rms(r°). 
The ratio rms(r*)/rms(y*) gives a relative prediction error. For example, if the 
relative prediction error is 0.1, we might say that the model predicts the outcomes, 


or fits the data, within 10%. 
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Least squares model fitting. A very common method for choosing the model 
parameters 01,...,0p is to minimize the RMS prediction error on the given data 
set, which is the same as minimizing the sum of squares of the prediction errors, 
||r4]|?. We now show that this is a least squares problem. 

Expressing g = f(2) in terms of the model parameters, we have 


Gg = And, +--+ + Aip%, i=1,...,N, 


where we define the N x p matrix A as 


Ajp=file), i=1,...,N, j=1,...,Pp, (13.1) 
and the p-vector 0 as 0 = (01,...,0p). The jth column of A is the jth basis 
function, evaluated at each of the data points ¢),...,2). Its ith row gives the 


values of the p basis functions on the ith data point 2. In matrix-vector notation 
we have 
g° = AG. 
This simple equation shows how our choice of model parameters maps into the 
vector of predicted values of the outcomes in the N different experiments. We 
know the matrix A from the given data points, and choice of basis functions; our 
goal is to choose the p-vector of model coefficients 0. 
The sum of squares of the residuals is then 


Ir“? = lyt — 991? = lly? — Al]? = [40 — S|. 


(In the last step we use the fact that the norm of a vector is the same as the norm 
of its negative.) Choosing 0 to minimize this is evidently a least squares problem, 
of the same form as (12.1). Provided the columns of A are linearly independent, 
we can solve this least squares problem to find 6, the model parameter values that 
minimize the norm of the prediction error on our data set, as 


6 = (ATA) !ATyt = Atys. (13.2) 


We say that the model parameter values 6 are obtained by least squares fitting on 
the data set. 

We can interpret each term in ||y? — Aé||?. The term ĝt = A9 is the N-vector 
of measurements or outcomes that is predicted by our model, with the parameter 
vector 0. The term y® is the N-vector of actual observed or measured outcomes. 
The difference y4 — A9 is the N-vector of prediction errors. Finally, ||y4 — A6||? is 
the sum of squares of the prediction errors, also called the residual sum of squares 
(RSS). This is minimized by the least squares fit 0 = 0. 

The number ||y“ — A6|l? is called the minimum sum square error (for the given 
model basis and data set). The number ||y4 — A6||?/N is called the minimum 
mean square error (MMSE) (of our model, on the data set). Its squareroot is 
the minimum RMS fitting error. The model performance on the data set can be 
visualized by plotting 7 versus y on a scatter plot, with a dashed line showing 
gy = y for reference. 

Since |y% — A6||? = ||A@ — y4||?, the same least squares model parameter is 
obtained when the residual or prediction error is defined as 74 — yt instead of (our 
definition) ył — 4. The residual sum of squares, minimum mean square error, and 
RMS fitting error also agree using this alternate definition of prediction error. 
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Some notation differences from chapter 12. Before proceeding we note some 
differences in the meanings of symbols used in chapter 12 (on least squares) and in 
this chapter on data fitting, that the reader will need to keep in mind. In chapter 12, 
the symbol x denotes a generic variable, the vector that we would like to find, and 
b refers to the so-called right-hand side, the vector we seek to approximate. In this 
chapter, in the context of fitting a model to data, the symbol x generically refers 
to a feature vector; we want to find 0, the vector of coefficients in our model, and 
the vector we seek to approximate is ył, a vector of (observed) data outcomes. 
When we use least squares in this chapter, we will need to transcribe the results 
or formulas from chapter 12 to the current context, as in the formula (13.2). 


Least squares fit with a constant. We start with the simplest possible fit: We 
take p = 1, with fı(x) = 1 for all x. In this case the model f is a constant function, 
with f (x) = 6, for all x. Least squares fitting in this case is the same as choosing 
the best constant value 6, to approximate the data y,...,y©%). 

In this simple case, the matrix A in (13.1) is the N x 1 matrix 1, which always 
has linearly independent columns (since it has one column, which is nonzero). The 
formula (13.2) is then 


ĝi = (ATA) 1 ATy? = N~117y! = ave(y*), 
where we use 171 = N. So the best constant fit to the data is simply its mean, 
f(x) = avg(y’). 
The RMS fit to the data (i.e., the RMS value of the optimal residual) is 
rms(y° — avg(y°)1) = std(y*), 


the standard deviation of the data. This gives a nice interpretation of the average 
value and the standard deviation of the outcomes, as the best constant fit and the 
associated RMS error, respectively. It is common to compare the RMS fitting error 
for a more sophisticated model with the standard deviation of the outcomes, which 
is the optimal RMS fitting error for a constant model. 

A simple example of a constant fit is shown in figure 13.1. In this example we 
have n = 1, so the data points x are scalars. The green circles in the left-hand 
plot show the data points; the blue line shows the prediction function f(x) (which 
has constant value). The right-hand plot shows a scatter plot of the data outcomes 
y© versus the predicted values g (all of which are the same), with a dashed line 
showing y = ĝ. 


Independent column assumption. ‘To use least squares fitting we assume that 
the columns of the matrix A in (13.1) are linearly independent. We can give 
an interesting interpretation of what it means when this assumption fails. If the 
columns of A are linearly dependent, it means that one of the columns can be 
expressed as a linear combination of the others. Suppose, for example, that the 
last column can be expressed as a linear combination of the first p — 1 columns. 
Using Aj; = fj(x®), this means 


F”) = Bifi(e®) +--+ Bprfp-i(e), i=1,...,N. 


13.1.1 
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Figure 13.1 The constant fit f(x) = avg(y*) to N = 20 data points and a 
scatter plot of g™ versus y®. 
This says that the value of the pth basis function can be expressed as a linear 
combination of the values of the first p — 1 basis functions on the given data set. 
Evidently, then, the pth basis function is redundant (on the given data set). 
Fitting univariate functions 


Suppose that n = 1, so the feature vector x is a scalar (as is the outcome y). The 
relationship y ~ f(x) says that y is approximately a (univariate) function f of x. 
We can plot the data (2, y™) as points in the (x,y) plane, and we can plot the 
model f as a curve in the (x,y)-plane. This allows us to visualize the fit of our 
model to the data. 


Straight-line fit. We take basis functions fı(x) = 1 and fo(x) = x. Our model 
has the form 


f(x) = 01 + box, 


which is a straight line when plotted. (This is perhaps why f is sometimes called a 
linear model, even though it is in general an affine, and not linear, function of x.) 
Figure 13.2 shows an example. The matrix A in (13.1) is given by 


1 z® 
1 x) 

A=]. |=[1 æ], 
1 2) 


where in the right-hand side we use x¢ to denote the N-vector of values x? = 


(z®,... £0). Provided that there are at least two different values appearing in 
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x 
Figure 13.2 Straight-line fit to 50 points («™,y™) in a plane. 


xr®,... £0), this matrix has linearly independent columns. The parameters in 
the optimal straight-line fit to the data are given by 


| 0; | = (AP Ay At ys: 
This expression is simple enough for us to work it out explicitly, although there is 


no computational advantage to doing so. The Gram matrix is 


N 17 x4 | 


TA— 
ATA=] na iaa 


The 2-vector AT y® is 


so we have (using the formula for the inverse of a 2 x 2 matrix) 


| ĝi | 7 1 | (2*)?at —17 gz! | | 17y4 | l 


6, | N(w)?at— (ied? | -17T24 N (2*)Fy4 


Multiplying the scalar term by N?, and dividing the matrix and vector terms by 
N, we can express this as 


| b, | a 1 | rms(z9)? — avg(z*) | | avg(y") | l 


6. | rms(x*)? — avg(x9)2 | — avg(z°) 1 (at) yr /N 


The optimal slope Ê, of the straight line fit can be expressed more simply in terms of 
the correlation coefficient p between the data vectors xt and yt, and their standard 
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deviations. We have 
p o Mey- ate ay!) 
2 N(a9)T x3 — (17x9)? 
(a — avg(z°)1)" (y° — avg(y°)1) 
zt — avg(a*)1||? 
std(y") 
std(z3) 


In the last step we used the definitions 


_ liz? - avg(z*)1|| 


std(«*) = JN 


from chapter 3. From the first of the two normal equations, N64 +(1T4)0> = 17 y%, 
we also obtain a simple expression for 01: 


_ (2° = avg(z°)1)" (y° — avg(y°)1) 
7 N std(x*) std(y*) , 


6, = avg(y") — 62 avg(x°). 


Putting these results together, we can write the least squares fit as 


s d 
f(x) = avg(y?) + pasa 


(a — avg(x")). (13.3) 
(Note that x and y are generic scalar values, while x? and y® are vectors of the 
observed data values.) When std(y*) 4 0, this can be expressed in the more 
symmetric form 
j—ave(y®) 2 —avg(x") 
std)” std) 
which has a nice interpretation. The left-hand side is the difference between the 
predicted response value and the mean response value, divided by its standard de- 
viation. The right-hand side is the correlation coefficient p times the same quantity, 
computed for the dependent variable. 
The least squares straight-line fit is used in many application areas. 


Asset a and £ in finance. In finance, the straight-line fit is used to predict the 
return of an individual asset from the return of the whole market. (The return of 
the whole market is typically taken to be a sum of the individual asset returns, 
weighted by their capitalizations.) The straight-line model f(x) = 6, +62 predicts 
the asset return from the market return x. The least squares straight-line fit is 


computed from observed market returns r@**,...,r@*t and individual asset returns 
rind... rind over some period of length T. We therefore take 
d mkt mkt d _ ind „ind 
z = (rE, a Tp) yf = (ry... T) 


in (13.3). The model is typically written in the form 


f(z) = (r +a) + B(e— pw"), 
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where r™ is the risk-free interest rate over the period and p™*t = avg(x*) is 
the average market return. Comparing this formula to the straight-line model 
f(x) = 01 + 622, we find that 02 = 8, and 01 = rË + a — Bum. 

The prediction of asset return f (x) has two components: A constant r™ +a, and 
one that is proportional to the de-meaned market performance, 8(x — u™!t). The 
second component, which has average value zero, relates market return fluctuations 
to the asset return fluctuations, and is related to the correlation of the asset and 
market returns; see exercise 13.4. The parameter a is the average asset return, 
over and above the risk-free interest rate. This model of asset return in terms of 
the market return is so common that the terms ‘Alpha’ and ‘Beta’ are widely used 
in finance. (Though not always with exactly the same meaning, since there are a 
few variations on how the parameters are defined.) 


Time series trend. Suppose the data represents a series of samples of a quantity 
y at time (epoch) 2 = i. The straight-line fit to the time series data, 


g =0 +0, i=1,...,N, 


is called the trend line. Its slope, which is 62, is interpreted as the trend in the 
quantity over time. Subtracting the trend line from the original time series we 
get the de-trended time series, yt — ĝt. The de-trended time series shows how the 
time series compares with its straight-line fit: When it is positive, it means the 
time series is above its straight-line fit, and when it is negative, it is below the 
straight-line fit. 

An example is shown in figures 13.3 and 13.4. Figure 13.3 shows world petroleum 
consumption versus year, along with the straight-line fit. Figure 13.4 shows the 
de-trended world petroleum consumption. 


Estimation of trend and seasonal component. In the previous example, we used 
least squares to approximate a time series y1 = (y™,...,y%)) of length N by a 
sum of two components: y? œ ĝt = goorst + ġe where 


In many applications, the de-trended time series has a clear periodic component, 
i.e., a component that repeats itself periodically. As an example, figure 13.5 shows 
an estimate of the road traffic (total number of miles traveled in vehicles) in the 
US, for each month between January 2000 and December 2014. The most striking 
aspect of the time series is the pattern that is (approximately) repeated every year, 
with a peak in the summer and a minimum in the winter. In addition there is a 
slowly increasing long term trend. The bottom figure shows the least squares fit of 
a sum of two components 
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Figure 13.3 World petroleum consumption between 1980 and 2013 (dots) 
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and least squares straight-line fit (data from www.eia.gov). 
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Figure 13.4 De-trended world petroleum consumption. 
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Figure 13.5 Top. Vehicle miles traveled in the US, per month, in the period 
January 2000 — December 2014 (U.S. Department of Transportation, Bureau 
of Transportation Statistics, www.transtats.bts.gov). Bottom. Least squares 
fit of a sum of two time series: A linear trend and a seasonal component 


with a 12-month period. 
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where gj!” and °S are defined as 
1 02:(P+1) 
agal > |, gen | PE 
N TER 


The second component is periodic or seasonal, with period P = 12, and consists of 
the pattern (62,...,9p+41), repeated N/P times (we assume N is a multiple of P). 
The constant term is omitted in the model because it would be redundant: It has 
the same effect as adding a constant to the parameters 62,...,0p41. 

The least squares fit is computed by minimizing || A0—y ajj? where 6 is a (P+1)- 
vector and the matrix A in (13.1) is given by 


1 10- 0 

2 01>- 0 

P 0 0 1 

P41 1 0 0 

P+2 0 1 0 

A= : 

2P 0 0 1 
N-P+1 1 0 0 
N-P+2 0 1 0 

N 00- 1 


In this example, N = 15P = 180. The residual or prediction error in this case is 
called the de-trended, seasonally-adjusted series. 


Polynomial fit. A simple extension beyond the straight-line fit is a polynomial fit, 
with 
filz) = 27t, i= 1,...,p, 


so f is a polynomial of degree at most p — 1, 
f(a) = 0i + bax +- + Opa?! 


(Note that here, zê means the generic scalar value x raised to the ith power; 2) 
means the ith observed scalar data value.) In this case the matrix A in (13.1) has 


the form 
1 x) oA g\)ye-1 


1 gf) ...  (g@))p-1 
A=]. ; ; 


1 aM) 22. (gp 
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a Degree 2 a Degree 6 


Figure 13.6 Least squares polynomial fits of degree 2, 6, 10, and 15 to 100 
points. 


i.e., it is a Vandermonde matrix (see (6.7)). Its columns are linearly independent 
provided the numbers «“),..., 2%) include at least p different values. Figure 13.6 
shows an example of the least squares fit of polynomials of degree 2, 6, 10, and 15 
to a set of 100 data points. Since any polynomial of degree less than r is also a 
polynomial of degree less than s, for r < s, it follows that the RMS fit attained 
by a polynomial with a larger degree is smaller (or at least, no larger) than that 
obtained by a fit with a smaller degree polynomial. This suggests that we should 
use the largest degree polynomial that we can, since this results in the smallest 
residual and the best RMS fit. But we will see in §13.2 that this is not true, and 
explore rational methods for choosing a model from among several candidates. 


Piecewise-linear fit. A piecewise-linear function, with knot points or kink points 
a, < a2 <-++-+ < ak, is a continuous function that is affine in between the knot 
points. (Such functions should be called piecewise-affine.) We can describe any 
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Figure 13.7 The piecewise-linear functions (x + 1)+ = max{x + 1,0} and 
(x — 1)+ = max{x — 1,0}. 


piecewise-linear function with k knot points using the p = k + 2 basis functions 


filz) =1, fa(z) = x, fi+2(x) = (x — ai)+, i=1,...,k, 


where (u)+ = max{u, 0}. These basis functions are shown in figure 13.7 for k = 2 
knot points at a; = —1, ag = 1. An example of a piecewise-linear fit with these 
knot points is shown in figure 13.8. 


Regression 


We now return to the general case when x is an n-vector. Recall that the regression 
model has the form 


=x" +v, 


where ( is the weight vector and v is the offset. We can put this model in our 
general data fitting form using the basis functions f(z) = 1, and 


fila) = 2-1, 7=2,...,n+1, 
so p = n + 1. The regression model can then be expressed as 
gy = £T O5.(n41) oe (1, 


and we see that 8 = 62.n4; and v = 6). 
The N x (n+ 1) matrix A in our general data fitting form is given by 


sofi x, 


where X is the feature matrix with columns z®,..., £0). So the regression model 
is a special case of our general linear in the parameters model. 
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Figure 13.8 Piecewise-linear fit to 100 points. 


General fitting model as regression. The regression model is a special case of 
our general data fitting model. Conversely, we can think of our linear in the 
parameters model as regression, with a different set of feature vectors of dimension 
p— 1. Assuming the first basis element fı is the constant function with value one, 
we consider the new or generated feature vectors % given by 


ae 
Lge | 


A regression model for the outcome y and the new generated or mapped features 
z has the form 
gG=E™B +v, 

where ( has dimension p — 1, and v is a number. Comparing this to our linear in 
the parameters model 

§ =A filz) +: + Ofp(2), 
we see that they are the same, with v = 0), 6 = 02:p. So we can think of the general 
linear in the parameters model as nothing more than simple regression, but applied 
to the transformed, mapped, or generated features f1(z),..., fp(x). (This idea is 
discussed more in §13.3.) 


House price regression. In §2.3 we described a simple regression model for the 
selling price of a house based on two attributes, area and number of bedrooms. 
The values of the parameters 6 and the offset v given in (2.9) were computed by 
least squares fitting on a set of data consisting of 774 house sales in Sacramento 
over a 5 day period. The RMS fitting error for the model is 74.8 (in thousands of 
dollars). For comparison, the standard deviation of the prices in the data set is 
112.8. So this very basic regression model predicts the prices substantially better 
than a constant model (i.e., the mean price of the houses in the data set). 


13.1.3 
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Auto-regressive time series model. Suppose that 21, z2,... is a time series. An 
auto-regressive model (also called AR model) for the time series has the form 


Zep. = hizt +-+°+Oum%—-mMy1, t=M,M+1,... 


where M is the memory or lag of the model. Here 2,41 is the prediction of 244 
made at time t (when z%,...,2:-.¢41 are known). This prediction is a linear func- 
tion of the previous M values of the time series. With good choice of model 
parameters, the AR model can be used to predict the next value in a time series, 
given the current and previous M values. This has many practical uses. 

We can use least squares (or regression) to choose the AR model parameters, 
based on the observed data z1,...,27, by minimizing the sum of squares of the 
prediction errors z,— 2, over t = M + 1,...,T, i.e., 


(zm+1 — 2ugi)? ++ (er êr). 


(We must start the predictions at t = M + 1, since each prediction involves the 
previous M time series values, and we do not know Zo, 2_1,....) 

The AR model can be put into the general linear in the parameters model form 
by taking 
(4) 


y" = zm 44; gp) (Bs ay evas 2i), i=1,...,T7-M. 


We have N = T — M examples, and n = M features. 

As an example, consider the time series of hourly temperature at Los Angeles 
International Airport, May 1-31, 2016, with length 31 - 24 = 744. The simple 
constant prediction 2:41 = 61.76°F (the average temperature) has RMS prediction 
error 3.05°F (the standard deviation). The very simple predictor 2441 = 2, i.e., 
guessing that the temperature next hour is the same as the current temperature, has 
RMS error 1.16°F. The predictor 2,41 = z_93, i.e., guessing that the temperature 
next hour is what is was yesterday at the same time, has RMS error 1.73°F. 

We fit an AR model with memory M = 8 using least squares, with N = 
31-24 — 8 = 736 samples. The RMS error of this predictor is 0.98°F, smaller than 
the RMS errors for the simple predictors described above. Figure 13.9 shows the 
temperature and the predictions for the first five days. 


Log transform of dependent variable 


When the dependent variable y is positive and varies over a large range, it is 
common to replace it with its logarithm w = logy, and then use least squares to 
develop a model for w, Ù = g(x). We then form our estimate of y using ĝ = e9). 
When we fit a model w = g(x) to the logarithm w = logy, the fitting error for w 
can be interpreted in terms of the percentage or relative error between y and y, 
defined as 

n = max{ĝ/y, y/ĝ} — 1. 
So ņ = 0.1 means either ĝ = 1.1y (i.e., we over-estimate by 10%) or ĝ = (1/1.1)y 
(i.e., we under-estimate by 10%). The connection between the relative error be- 
tween ĝ and y, and the residual r in predicting w, is 


n=e"l—1 
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Figure 13.9 Hourly temperature at Los Angeles International Airport be- 
tween 12:53AM on May 1, 2016, and 11:53PM on May 5, 2016, shown as 
circles. The solid line is the prediction of an auto-regressive model with eight 
coefficients. 


(see exercise 13.16). For example, a residual with |r| = 0.05 corresponds (ap- 
proximately) to a relative error in our prediction ĝ of y of 5%. (Here we use the 
approximation e!"! — 1 ~ |r| for r smaller than, say, 0.15.) So if our RMS error in 
predicting w = log y across our examples is 0.05, our predictions of y are typically 
within +5%. 

As an example suppose we wish to model house sale prices over an area and 
period that includes sale prices varying from under $100k to over $10M. Fitting the 
prices directly means that we care equally about absolute errors in price predictions, 
so, for example, a prediction error of $20k should bother us equally for a house that 
sells for $70k and one that sells for $6.5M. But (at least for some purposes) in the 
first case we have made a very poor estimate, and the second case, a remarkably 
good one. When we fit the logarithm of the house sale price, we are seeking low 
percentage prediction errors, not necessarily low absolute errors. 


Whether or not to use a logarithmic transform on the dependent variable (when 
it is positive) is a judgment call that depends on whether we seek small absolute 
prediction errors or small relative or percentage prediction errors. 


Validation 


Generalization ability. In this section we address a key point in model fitting: 
The goal of model fitting is typically not to just achieve a good fit on the given 
data set, but rather to achieve a good fit on new data that we have not yet seen. 
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This leads us to a basic question: How well can we expect a model to predict y for 
future or other unknown values of x? Without some assumptions about the future 
data, there is no good way to answer this question. 

One very common assumption is that the data are described by a formal prob- 
ability model. With this assumption, techniques from probability and statistics 
can be used to predict how well a model will work on new, unseen data. This 
approach has been very successful in many applications, and we hope that you will 
learn about these methods in another course. In this book, however, we will take 
a simple intuitive approach to this issue. 

If a model predicts the outcomes for new unseen data values as well, or nearly 
as well, as it predicts the outcomes on the data used to form the model, it is said 
to have good generalization ability. In the opposite case, when the model makes 
predictions on new unseen data that are much worse than the predictions on the 
data used to form the model, the model is said to have poor generalization ability. 
So our question is: How can we assess the generalization ability of a model? 


Validation on a test set. A simple but effective method for assessing the gener- 
alization ability of a model is called out-of-sample validation. We divide the data 
we have into two sets: A training set and a test set (also called a validation set). 
This is often done randomly, with 80% of the data put into the training set and 
20% put into the test set. A common way to describe this is to say that ‘20% of 
the data were reserved for validation’. Another common choice for the split ratio 
between the training set and the test set is 90%-10%. 

To fit our model, we use only the data in the training set. The model that we 
come up with is based only on the data in the training set; the data in the test set 
has never been ‘seen’ by the model. Then we judge the model by its RMS fit on the 
test set. Since the model was developed without any knowledge of the test set data, 
the test data are effectively data that are new and unseen, and the performance of 
our model on this data gives us at least an idea of how our model will perform on 
new, unseen data. If the RMS prediction error on the test set is much larger than 
the RMS prediction error on the training set, we conclude that our model has poor 
generalization ability. Assuming that the test data are ‘typical’ of future data, the 
RMS prediction error on the test set is what we might guess our RMS prediction 
error will be on new data. 

If the RMS prediction error of the model on the training set is similar to the 
RMS prediction error on the test set, we have increased confidence that our model 
has reasonable generalization ability. (A more sophisticated validation method 
called cross-validation, described below, can be used to gain even more confidence.) 

For example, if our model achieves an RMS prediction error of 10% (compared 
to rms(y)) on the training set and 11% on the test set, we can guess that it will have 
a similar RMS prediction error on other unseen data. But there is no guarantee 
of this, without further assumptions about the data. The basic assumption we are 
making here is that the future data will ‘look like’ the test data, or that the test 
data were ‘typical’. Ideas from statistics can make this idea more precise, but we 
will leave this idea informal and intuitive. 
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Over-fitting. When the RMS prediction error on the training set is much smaller 
than the RMS prediction error on the test set, we say that the model is over-fit. 
It tells us that, for the purposes of making predictions on new, unseen data, the 
model is much less valuable than its performance on the training data suggests. 
Roughly speaking, an over-fit model trusts the data it has seen (i.e., the training 
set) too much; it is too sensitive to the changes in the data that will likely be seen 
in the future data. One method for avoiding over-fit is to keep the model simple; 
another technique, called regularization, is discussed in chapter 15. Over-fit can be 
detected and (one hopes) avoided by validating a model on a test set. 


Model prediction quality and generalization ability. Model generalization ability 
and training set prediction quality are not the same. A model can perform poorly 
and yet have good generalization ability. As an example, consider the (very simple) 
model that always makes the prediction ĝ = 0. This model will (likely) perform 
poorly on the training set and the test set data, with similar RMS errors, assuming 
the two data sets are ‘similar’. So this model has good generalization ability, but 
has poor prediction quality. In general, we seek a model that makes good predictions 
on the training data set and also makes good predictions on the test data set. In 
other words, we seek a model with good performance and generalization ability. 
We care much more about a model’s performance on the test data set than the 
training data set, since its performance on the test data set is much more likely to 
predict how the model will do on (other) unseen data. 


Choosing among different models. We can use least squares fitting to fit multiple 
models to the same data. For example, in univariate fitting, we can fit a constant, 
an affine function, a quadratic, or a higher order polynomial. Which is the best 
model among these? Assuming that the goal is to make good predictions on new, 
unseen data, we should choose the model with the smallest RMS prediction error on 
the test set. Since the RMS prediction error on the test set is only a guess about 
what we might expect for performance on new, unseen data, we can soften this 
advice to we should choose a model that has test set RMS error that is near the 
minimum over the candidates. If multiple candidates achieve test set performance 
near the minimum, we should choose the ‘simplest’ one among these candidates. 

We observed earlier that when we add basis functions to a model, our fitting 
error on the training data can only decrease (or stay the same). But this is not 
true for the test error. The test error need not decrease when we add more basis 
functions. Indeed, when we have too many basis functions, we can expect over-fit, 
i.e., larger error on the test set. 

If we have a sequence of basis functions f1, fo,... we can consider models based 
on using just fı (which is often the constant function 1), then fı and fo, and so on. 
As we increase p, the number of basis functions, our training error will go down 
(or stay the same). But the test error typically decreases at first and then starts to 
increase for larger p. The intuition for this typical behavior is that for p too small, 
our model is ‘too simple’ to fit the data well, and so cannot make good predictions; 
when p is too large, our model is ‘too complex’ and suffers from over-fit, and so 
makes poor predictions. Somewhere in the middle, where the model achieves near 
minimum test set performance, is a good choice (or several good choices) of p. 
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a Degree 2 a Degree 6 


Figure 13.10 The polynomial fits of figure 13.6 evaluated on a test set of 100 
points. 


Example. To illustrate these ideas, we consider the example shown in figure 13.6. 
Using a training set of 100 points, we find least squares fits of polynomials of degrees 
0,1,...,20. (The polynomial fits of degrees 2, 6, 10, and 15 are shown in the figure.) 
We now obtain a new set of data for validation, also with 100 points. These test 
data are plotted along with the polynomial fits obtained from the training data 
in figure 13.10. This is a real check of our models, since these data points were 
not used to develop the models. Figure 13.11 shows the RMS training and test 
errors for polynomial fits of different degrees. We can see that the RMS training 
error decreases with every increase in degree. The RMS test error decreases until 
degree 6 and starts to increase for degrees larger than 6. This plot suggests that a 
polynomial fit of degree 6 is a reasonable choice. (Degree 4 is another reasonable 
choice, considering that it achieves nearly minimum test error, and is ‘simpler’ than 
the model with degree 6.) Note also that the models with degrees 0, 1, and 2 have 
good generalization ability (7.e., similar performance on the training and test sets), 
but worse prediction performance than models with higher degrees. 
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Figure 13.11 RMS error versus polynomial degree for the fitting example 
in figures 13.6 and 13.10. Circles indicate RMS errors on the training set. 
Squares show RMS errors on the test set. 


With a 6th degree polynomial, the relative RMS test error for both training 
and test sets is around 0.3. It is a good sign, in terms of generalization ability, 
that the training and test errors are similar. While there are no guarantees, we can 
guess that the 6th degree polynomial model will have a relative RMS error around 
0.3 on new, unseen data, provided the new, unseen data is sufficiently similar to 
the test set data. 


Cross-validation. Cross-validation is an extension of out-of-sample validation that 
can be used to get even more confidence in the generalization ability of a model, 
or more accurately, a choice of basis functions used to construct a model. We 
divide the original data set into 10 sets, called folds. We then fit the model using 
folds 1,2,...,9 as training data, and fold 10 as test data. (So far, this is the same 
as out-of-sample validation.) We then fit the model using folds 1,2,...,8,10 as 
training data and fold 9 as the test data. We continue, fitting a model for each 
choice of one of the folds as the test set. We end up with 10 (presumably different) 
models, and 10 assessments of these models using the fold that was not used to fit 
the model. (We have described 10-fold cross-validation here; 5-fold cross-validation 
is also commonly used.) If the test fit performance of these 10 models is similar, 
we can expect the same, or at least similar, performance on new unseen data. In 
cross-validation we can also check for stability of the model coefficients. This means 
that the model coefficients found in the different folds are similar to each other. 
Stability of the model coefficients further enhances our confidence in the model. 
To obtain a single number that is our guess of the prediction RMS error we can 
expect on new, unseen data, it is common practice to compute the RMS test set 
error across all 10 folds. For example, if €1,..., €10 are the RMS prediction errors 


13.2 Validation 


265 


Model parameters RMS error 
Fold v By Bo Train Test 


1 60.65 143.36 —18.00 74.00 78.44 
2 54.00 151.11 —20.30 75.11 73.89 
3 49.06 157.75 —21.10 76.22 69.93 
4 47.96 142.65 —14.35 71.16 88.35 
5 60.24 150.13 —21.11 77.28 64.20 


Table 13.1 Five-fold cross-validation for the simple regression model of the 
house sales data set. The RMS cross-validation error is 75.41. 


obtained by our models on the test folds, we take 


VG t+ +4)/10 (13.4) 


as our guess of the RMS error our models might make on new data. In a plot 
like that in figure 13.11, the RMS test error over all folds is plotted, instead of the 
RMS test error on the single data or validation set, as in that plot. The single 
number (13.4) is called the RMS cross-validation error, or simply the RMS test 
error (when cross-validation is used). 

Note that cross-validation does not check a particular model, since it creates 
10 different (but hopefully not very different) models. Cross-validation checks a 
selection of basis functions. Once cross-validation is used to verify that a choice 
of basis functions produces models that predict and generalize well, there is the 
question of which of the 10 models one should use. The models should be not too 
different, so the choice really should not matter much. One reasonable choice is to 
use the parameters obtained by fitting a model over all the data; another option is 
to use the average of the model parameters from the different folds. 


House price regression model. As an example, we apply cross-validation to assess 
the generalization ability of the simple regression model of the house sales data 
discussed in §2.3 and on page 258. The simple regression model described there, 
based on house area and number of bedrooms, has an RMS fitting error of 74.8 
thousand dollars. Cross-validation will help us answer the question of how the 
model might do on different, unseen houses. 

We randomly partition the data set of 774 sales records into five folds, four of 
size 155 and one of size 154. Then we fit five regression models, each of the form 


Ñ = v + Pix, + Boxe 


to the data set after removing one of the folds. Table 13.1 summarizes the results. 
The model parameters for the 5 different regression models are not exactly the 
same, but quite similar. The training and test RMS errors are reasonably similar, 
which suggests that our model does not suffer from over-fit. Scanning the RMS 
error on the test sets, we can expect that our prediction error on new houses will 
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Fold v RMS error (train) RMS error (test) 


1 230.11 110.93 119.91 
2 230.25 113.49 109.96 
3 228.04 114.47 105.79 
4 225.23 110.35 122.27 
5 230.23 114.51 105.59 


Table 13.2 Five-fold cross-validation for the constant model of the house 
sales data set. The RMS cross-validation error is 119.93. 


be around 70-80 (thousand dollars) RMS. The RMS cross-validation error (13.4) is 
75.41. We can also see that the model parameters change a bit, but not drastically, 
in each of the folds. This gives us more confidence that, for example, G2 being 
negative is not a fluke of the data. 

For comparison, table 13.2 shows the RMS errors for the constant model ĝ = v, 
where v is the mean price of the training set. The results suggest that the constant 
model can predict house prices with a prediction error around 105-120 (thousand 
dollars). The RMS cross-validation error for the constant model is 119.93. 

Figure 13.12 shows the scatter plots of actual and regression model predicted 
prices for each of the five training and test sets. The results for training and 
test sets are reasonably similar in each case, which gives us confidence that the 
regression model will have similar performance on new, unseen houses. 


Validating time series predictions. When the original data are unordered, for 
example, patient records or customer purchase histories, the division of the data 
into training and test sets is typically done randomly. This same method can 
be used to validate a time series prediction model, such as an AR model, but it 
does not give the best emulation of how the model will ultimately be used. In 
practice, the model will be trained on past data and then used to make predictions 
on future data. When the training data in a time series prediction model are 
randomly chosen, the model is being built with some knowledge of the future, a 
phenomenon called look-ahead or peek-ahead. Look-ahead can make a model look 
better than it really is at making predictions. 

To avoid look-ahead, the training set for a time series prediction model is typ- 
ically taken to be the data examples up to some point in time, and the test data 
are chosen as points that are past that time (and sometimes, at least M samples 
past that time, taking into account the memory of the predictor). In this way we 
can say that the model is being tested by making predictions on data it has never 
seen. As an example, we might train an AR model for some daily quantity using 
data from the years 2006 through 2008, and then test the resulting AR model on 
the data from year 2009. 

As an example, we return to the AR model of hourly temperatures at Los 
Angeles International Airport described on page 259. We divide the one month of 
data into a training set (May 1-24) and a test set (May 25-31). The coefficients in 
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Figure 13.12 Scatter plots of actual and predicted prices for the five simple 
regression models of table 13.1. The horizontal axis is the actual selling price 
and the vertical axis is the predicted price, both in thousands of dollars. Blue 
circles are samples in the training set, red circles samples in the test set. 


268 


13 Least squares data fitting 


Temperature (°F) 


Figure 13.13 Hourly temperature at Los Angeles International Airport be- 
tween 12:53AM on May 25, 2016, and 11:53PM on May 29, 2016, shown 
as circles. The solid line is the prediction of an auto-regressive model with 
eight coefficients, developed using training data from May 1 to May 24. 


an AR model are computed using the (24)(24) —8 = 568 samples in the training set. 
The RMS error on the training set is 1.03°F. The RMS prediction error on the test 
set is 0.98°F, which is similar to the RMS prediction error on the training set, giving 
us confidence that the AR model is not over-fit. (The fact that the test RMS error is 
very slightly smaller than the training RMS error has no significance.) Figure 13.13 
shows the prediction on the first five days of the test set. The predictions look very 
similar to those shown in figure 13.9. 


Limitations of out-of-sample and cross-validation. Here we mention a few limi- 
tations of out-of-sample and cross-validation. First, the basic assumption that the 
test data and future data are similar can (and does) fail in some applications. For 
example, a model that predicts consumer demand, trained and validated on this 
year’s data, can make much poorer predictions next year, simply because consumer 
tastes shift. In finance, patterns of asset returns periodically shift, so models that 
predict well on test data from this year need not predict well next year. 

Another limitation arises when the data set is small, which makes it harder to 
interpret out-of-sample and cross-validation results. In this case the out-of-sample 
test RMS error might be small due to good luck, or large due to bad luck, in the 
selection of the test set. In cross-validation the test results can vary considerably, 
due to luck of which data points fall into the different folds. Here too concepts 
from statistics can make this idea more precise, but we leave it as an informal idea: 
With small data sets, we can expect to see more variation in test RMS prediction 
error than with larger data sets. 

Despite these limitations, out-of-sample and cross-validation are powerful and 
useful tools for assessing the generalization ability of a model. 
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Feature engineering 


In this section we discuss some methods used to find appropriate basis functions or 
feature mappings f1,..., fp. We observed above (in §13.1.2) that fitting a linear in 
the parameters model reduces to regression with new features which are the original 
features x mapped through the basis (or feature mapping) functions f),..., fp. 
Choosing the feature mapping functions is sometimes called feature engineering, 
since we are generating features to use in regression. 

For a given data set we may consider several, or even many, candidate choices 
of the basis functions. To choose among these candidate choices of basis functions, 
we use out-of-sample validation or cross-validation. 


Adding new features to get a richer model. In many cases the basis functions 
include the constant one, i.e., we have f;(x) = 1. (This is equivalent to having the 
offset in the basic regression model.) It is also very common to include the original 
features as well, as in f;(x) = £i—1, i = 2,...,n +1. If we do this, we are effectively 
starting with the basic regression model; we can then add new features to get a 
richer model. In this case we have p > n, so there are more mapped features than 
original features. (Whether or not it is a good idea to add the new features can be 
determined by out-of-sample validation or cross-validation.) 


Dimension reduction. In some cases, and especially when the number n of the 
original features is very large, the feature mappings are used to construct a smaller 
set of p < n features. In this case we can think of the feature mappings or basis 
functions as a dimension reduction or data aggregation procedure. 


Transforming features 


Standardizing features. Instead of using the original features directly, it is com- 
mon to apply a scaling and offset to each original feature, say, 


fi(x) = (a; — bi) /ai, 4=2,...,n+1, 


so that across the data set, the average value of f;(a) is near zero, and the standard 
deviation is around one. (This is done by choosing b; to be near the mean of 
the feature i values over the data set, and choosing a; to be near the standard 
deviation of the values.) This is called standardizing or z-scoring the features. 
The standardized feature values are easily interpretable since they correspond to 
z-values; for example, fə(x) = +3.3 means that the value of original feature 2 is 
quite a bit above the typical value. The standardization of each original feature is 
typically the first step in feature engineering. 

Note that the constant feature fı (x) = 1 is not standardized. (In fact, it cannot 
be standardized since its standard deviation across the data set is zero.) 


Winsorizing features. When the data include some very large values that are 
thought to be errors (say, in collecting the data), it is common to clip or win- 
sorize the data. This means that we set any data values with absolute value larger 
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than some chosen threshold to their sign times the chosen threshold. Assuming, 
for example, that a feature entry x5 has already been standardized (so it repre- 
sents z-scores across the examples), we replace zs with its winsorized value (with 
threshold 3), 


The term winsorize is named after the statistician Charles P. Winsor. 


Log transform. When feature values are positive and vary over a wide range, 
it is common to replace them with their logarithms. If the feature value also 
includes the value 0 (so the logarithm is undefined) a common variation on the log 
transformation is to use čą = log(x, +1). This compresses the range of values that 
we encounter. As an example, suppose the original features record the number 
of visits to websites over some time period. These can easily vary over a range of 
10000:1 (or even more) for a very popular website and a less popular one; taking the 
logarithm of the visit counts gives a feature with less variation, which is possibly 
more interpretable. (The decision as to whether to use the original feature values 
or their logarithms can be decided by validation.) 


Creating new features 


Expanding categoricals. Some features take on only a few values, such as —1 and 
1 or 0 and 1, which might represent some value like presence or absence of some 
symptom. (Such features are called Boolean.) A Likert scale response (see page 71) 
naturally only takes on a small number of values, such as —2, —1, 0, 1, 2. Another 
example is an original feature that takes on the values 1,2,...,7, representing the 
day of the week. Such features are called categorical in statistics, since they specify 
which category the example is in, and not some real number. 

Expanding a categorical feature with l values means replacing it with a set of 
l — 1 new features, each of which is Boolean, and simply records whether or not 
the original feature has the associated value. (When all these features are zero, 
it means the original feature had the default value.) As an example, suppose the 
original feature x; takes on only the values —1, 0, and 1. Using the feature value 
0 as the default feature value, we replace x; with the two mapped features 


n=l xı =-—1 n=l xrı=1 


0 otherwise, 0 otherwise. 
In words, fı (x) tells us if zı has the value —1, and fə(x) tells us if zı has the value 1. 
(We do not need a new feature for the default value zı = 0; this corresponds to 
fi(z) = falx) = 0.) This feature mapping is shown in table 13.3. 

Expanding a categorical feature with / values into l — 1 features that encode 
whether the feature has one of the (non-default) values is sometimes called one-hot 
encoding, because for any data example, only one of the new feature values is one, 
and the others are zero. (When the original feature has the default value, all the 
new features are zero.) 
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Table 13.3 The original categorical feature xı takes on only the three values 
listed in the first column. This feature is replaced with (expanded into) the 
two features f(x) and f2(x) shown in the second and third columns. 


There is no need to expand an original feature that is Boolean (i.e., takes on 
two values). If the original Boolean feature is encoded with the values 0 and 1, and 
0 is taken as the default value, then the one new feature value will be the same as 
the original feature value. 

As an example of expanding categoricals, consider a model that is used to 
predict house prices based on various features that include the number of bedrooms, 
that ranges from 1 to 5 (say). In the basic regression model, we use the number 
of bedrooms directly as a feature. In the basic model there is one parameter value 
that corresponds to value per bedroom; we multiply this parameter by the number 
of bedrooms to get the contribution to our price prediction. In this model, the price 
prediction increases (or decreases) by the same amount when we change the number 
of bedrooms from 1 to 2 as it does when we change the number of bedrooms from 
4 to 5. If we expand this categorical feature, using 2 bedrooms as the default, we 
have 4 Boolean features that correspond to a house having 1, 3, 4, and 5 bedrooms. 
We then have 4 parameters in our model, which assign different amounts to add 
to our prediction for houses with 1, 3, 4, and 5 bedrooms, respectively. This more 
flexible model can capture the idea that a change from 1 to 2 bedrooms is different 
from a change from 4 to 5 bedrooms. 


Generalized additive model. We introduce new features that are nonlinear func- 
tions of the original features, such as, for each x;, the functions min{x; + a, 0} 
and max{x; — b,0}, where a and b are parameters. These new features are readily 
interpreted: min{x; + a,0} is the amount by which feature x; is below —a, and 
max{xz; — b,0} is the amount by which feature x; is above b. A common choice, 
assuming that x; has already been standardized, is a = b = 1. This leads to the 
predictor 

§ = Vilar) +: + Un(@n), (13.5) 


where Y; is the piecewise-linear function 


wi (xi) = Ongi min{ z; +a, 0} + izi + O2n+i max{z; — b, 0}, (13.6) 


which has kink or knot points at the values —a and +b. The model (13.5) has 3n 
parameters, corresponding to the original features, and the two additional features 
per original feature. The prediction ĝ is a sum of functions of the original features, 
and is called a generalized additive model. (More complex versions add more than 
two additional functions of each original feature.) 
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W1(£1) w2(x2) 


Figure 13.14 The functions y; in (13.6) for n = 2, a = b = 1, and 6; = 0.5, 
b2 = —0.4, 03 = 0.3, 04 = —0.2, 0s = —0.3, be = 0.2. 


Consider an example with n = 2 original features. Our prediction ĝ is a sum 
of two piecewise-linear functions, each depending on one of the original features. 
Figure 13.14 shows an example. In this example model, we can say that increasing 
x1 increases our prediction ĝ; but for high values of zı (i.e., above 1) the increase 
in the prediction is less pronounced, and for low values (i.e., below —1), it is more 
pronounced. 


Products and interactions. New features can be developed from pairs of original 
features, for example, their product. From the original features we can add 2;2;, 
for i,j = 1,...,n, i < j. Products are used to model interactions among the 
features. Product features are easily interpretable when the original features are 
Boolean, i.e., take the values 0 or 1. Thus x; = 1 means that feature 7 is present 
or has occurred, and the new product feature x;7; has the value 1 exactly when 
both feature ¿i and j have occurred. 


Stratified models. In a stratified model, we have several different sub-models, 
and choose the one to use depending on the values of the regressors. For example, 
instead of treating gender as a regressor in a single model of some medical outcome, 
we build two different sub-models, one for male patients and one for female patients. 
In this case we choose the sub-model to use based on one of the original features, 
gender. 


As a more general example, we can carry out clustering of the original feature 
vectors, and fit a separate model within each cluster. To evaluate ĝ for a new x, we 
first determine which cluster x is in, and then use the associated model. Whether 
or not a stratified model is a good idea is checked using out-of-sample validation. 
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Advanced feature generation methods 


Custom mappings. In many applications custom mappings of the raw data are 
used as additional features, in addition to the original features given. For example 
in a model meant to predict an asset’s future price using prior prices, we might also 
use the highest and lowest prices over the last week. Another well known example 
in financial models is the price-to-earnings ratio, constructed from the price and 
(last) earnings features. 

In document analysis applications word count features are typically replaced 
with term frequency inverse document frequency (TFIDF) values, which scale the 
raw count values by a function of the frequency with which the word appears across 
the given set of documents, usually in such a way that uncommon words are given 
more weight. (There are many variations on the particular scaling function to use. 
Which one to use in a given application can be determined by out-of-sample or 
cross-validation.) 


Predictions from other models. In many applications there are existing models 
for the data. A common trick is to use the predictions of these models as features 
in your model. In this case you can describe your model as one that combines 
or blends the raw data available with predictions made from one or more existing 
models to create a new prediction. 


Distance to cluster representatives. We can build new features from a clustering 
of the data into k groups. One simple method uses the cluster representatives 
Z1,---,2k, and gives k new features, given by f(x) = en l2—all?/o? where ø is a 
parameter. 


Random features. The new features are given by a nonlinear function of a random 
linear combination of the original features. To add K new features of this type, 
we first generate a random K x n matrix R. We then generate new features as 
(Rx) or |Rx|, where (-)+ and |:| are applied elementwise to the vector Rx. (Other 
nonlinear functions can be used as well.) 

This approach to generating new features is quite counter-intuitive, since you 
would imagine that feature engineering should be done using detailed knowledge 
of, and intuition about, the particular application. Nevertheless this method can 
be very effective in some applications. 


Neural network features. A neural network computes transformed features using 
compositions of linear transformations interspersed with nonlinear mappings such 
as the absolute value. This architecture was originally inspired by biology, as a 
crude model of how human and animal brains work. The ideas behind neural 
networks are very old, but their use has accelerated over the last few years due to 
a combination of new techniques, greatly increased computing power, and access 
to large amounts of data. Neural networks can find good feature mappings directly 
from the data, provided there is a very large amount of data available. 


274 


13 Least squares data fitting 


13.3.4 


13.3.5 


Summary 


The discussion above makes it clear that there is much art in choosing features to 
use in a model. But it is important to keep several things in mind when creating 
new features: 


e Try simple models first. Start with a constant, then a simple regression model, 
and so on. You can compare more sophisticated models against these. 


e Compare competing candidate models using validation. Adding new features 
will always reduce the RMS error on the training data, but the important 
question is whether or not it substantially reduces the RMS error on the test 
or validation data sets. (We add the qualifier ‘substantially’ here because a 
small reduction in test set error is not meaningful.) 


e Adding new features can easily lead to over-fit. (This will show up when 
validating the model.) The most straightforward way to avoid over-fit is to 
keep the model simple. We mention here that another approach to avoiding 
over-fit, called regularization (covered in chapter 15), can be very effective 
when combined with feature engineering. 


House price prediction 


In this section we use feature engineering to develop a more complicated regres- 
sion model for the house sales data, illustrating some of the methods described 
above. As mentioned in §2.3, the data set contains records of 774 house sales in 
the Sacramento area. For our more complex model we will use four base attributes 
or original features: 


e x, is the area of the house (in 1000 square feet), 
e z is the number of bedrooms, 
e x3 is equal to one if the property is a condominium, and zero otherwise, 
e x4 is the five-digit ZIP code. 
Only the first two attributes were used in the simple regression model 
Y = Piz + Bore + V 
given in §2.3. In that model, we do not carry out any feature engineering or 


modification. 


Feature engineering. Here we examine a more complicated model, with 8 basis 
functions, 


8 
=F fle). 
=l 


These basis functions are described below. 
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LA fe(x) fr(x) fe(x) 
95811, 95814, 95816, 95817, 95818, 95819 0 0 0 
95608, 95610, 95621, 95626, 95628, 95655, 1 0 0 


95660, 95662, 95670, 95673, 95683, 95691, 
95742, 95815, 95821, 95825, 95827, 95833, 
95834, 95835, 95838, 95841, 95842, 95843, 
95864 


95624, 95632, 95690, 95693, 95757, 95758, 0 1 0 
95820, 95822, 95823, 95824, 95826, 95828, 
95829, 95831, 95832 


95603, 95614, 95630, 95635, 95648, 95650, 0 0 1 
95661, 95663, 95677, 95678, 95682, 95722, 
95746, 95747, 95762, 95765 


Table 13.4 Definition of basis functions fe, f7, fg as functions of x4 (5-digit 
ZIP code). 


The first basis function is the constant fı(x) = 1. The next two are functions 
of x1, the area of the house, 


falx) = 21, f(x) = max {x — 1.5, 0}. 


In words, f2(x) is the area of the house, and f3(x) is the amount by which the area 
exceeds 1.5 (t.e., 1500 square feet). The first three basis functions contribute to 
the price prediction model a piecewise-linear function of the house area, 


6, + borx xı <1.5 
61 filz) + bə fo(x) 4 63 f(x) = { 6, AE (60 a 63)x4 a > 1.5 


with one knot at 1.5. This is an example of a generalized additive model described 
on page 271. 

The basis function f4(x) is equal to the number of bedrooms x2. The basis 
function fs(x) is equal to z3, i.e., one if the property is a condominium, and 
zero otherwise. In these cases we simply using the original feature value, with no 
transformation or modification. 

The last three basis functions are again Boolean, and indicate or encode the 
location of the house. We partition the 62 different ZIP codes present in the data set 
into four groups, corresponding to different areas around the center of Sacramento, 
as shown in table 13.4. The basis functions fe, fz, and fg give a one-hot encoding 
of the four groups of ZIP codes, as described on page 270. 


The resulting model. The coefficients in the least squares fit are 
0, = 115.62, 0 = 175.41, 63 = —42.75, 64 = —17.88, 


6; = —19.05, 8s = —100.91, 8z = —108.79, 03 = —24.77. 
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Figure 13.15 Scatter plot of actual and predicted prices for a model with 
eight parameters. 


The RMS fitting error is 68.3, a bit better than the simple regression fit, which 
achieves RMS fitting error of 74.8. Figure 13.15 shows a scatter plot of predicted 
and actual prices. 

To validate the model, we use 5-fold cross-validation, using the same folds as in 
table 13.1 and figure 13.12. The results are shown in table 13.5 and figure 13.16. 
The training and test set errors are similar, so our model is not over-fit. We 
also see that the test set errors are a bit better than those obtained by our simple 
regression model; the RMS cross-validation errors are 69.29 and 75.41, respectively. 
We conclude that our more complex model, that uses feature engineering, gives a 
modest (around 8%) improvement in prediction ability over the simple regression 
model based on only house area and number of bedrooms. (With more data, more 
features, and more feature engineering, a much more accurate model of house price 
can be developed.) 

The table also shows that the model coefficients are reasonably stable across 
the different folds, giving us more confidence in the model. Another interesting 
phenomenon we observe is that the test error for fold 5 is a bit Jower on the test 
set than on the training set. This occasionally happens, as a consequence of how 
the original data were split into folds. 
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Figure 13.16 Scatter plots of actual and predicted prices for the five models 
of table 13.5. The horizontal axis is the actual selling price and the vertical 
axis is the predicted price, both in thousands of dollars. Blue circles are 
samples in the training set, red circles samples in the test set. 
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Model parameters RMS error 
Fold 6; b2 03 04 Os 06 Oz 08 Train Test 


1 122.35 166.87 —39.27 —16.31 —23.97 —100.42 —106.66 —25.98 67.29 72.78 
100.95 186.65 —55.80 —18.66 —14.81 99.10 109.62 —17.94 67.83 70.81 
133.61 167.15 —23.62 —18.66 —14.71 —109.32 —114.41 —28.46 69.70 63.80 
108.43 171.21 —41.25 —15.42 —17.68 94.17 103.63 —29.83 65.58 78.91 

114.45 185.69 —52.71 —20.87 —23.26 —102.84 —110.46 —23.43 70.69 58.27 


oe Wb 


Table 13.5 Five-fold validation on the house sales data set. The RMS cross- 
validation error is 69.29. 
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Exercises 


Error in straight-line fit. Consider the straight-line fit described on page 249, with data 
given by the N-vectors z? and yf. Let rì = y? — Ht denote the residual or prediction 
error using the straight-line model (13.3). Show that rms(r“) = std(y*),/1 — p?, where 
p is the correlation coefficient of x“ and y? (assumed non-constant). This shows that the 
RMS error with the straight-line fit is a factor 4/1 — p? smaller than the RMS error with 
a constant fit, which is std(y*). It follows that when zì and y? are highly correlated 


(p © 1) or anti-correlated (p + —1), the straight-line fit is much better than the constant 
fit. Hint. From (13.3) we have 


g- y? = p SSD ot — avg(e")1) - (° — avg(y")1). 


Expand the norm squared of this expression, and use 


(x? — avg(x")1)" (y* — avg(y")1) 
z4 — avg(x4)1|||ly2 — avg(y*)1|| ° 


p= 


Regression to the mean. Consider a data set in which the (scalar) 7 is the parent’s 
height (average of mother’s and father’s height), and y© is their child’s height. Assume 
that over the data set the parent and child heights have the same mean value u, and 
the same standard deviation ø. We will also assume that the correlation coefficient p 
between parent and child heights is (strictly) between zero and one. (These assumptions 
hold, at least approximately, in real data sets that are large enough.) Consider the simple 
straight-line fit or regression model given by (13.3), which predicts a child’s height from 
the parent’s height. Show that this prediction of the child’s height lies (strictly) between 
the parent’s height and the mean height u (unless the parent’s height happens to be 
exactly the mean pu). For example, if the parents are tall, i.e., have height above the 
mean, we predict that their child will be shorter, but still tall. This phenomenon, called 
regression to the mean, was first observed by the early statistician Sir Francis Galton (who 
indeed, studied a data set of parent’s and child’s heights). 


Moore’s law. The figure and table below show the number of transistors N in 13 micro- 
processors, and the year of their introduction. 


Year ‘Transistors E l d 

L a4 
1971 2,250 108 | 4 
1972 2,500 5 o | 
1974 5,000 -i e | 
1978 29,000 WE o E 
1982 120,000 5 [ o J 
1985 275,000 2 108 E o 4 
1989 1,180,000 S E " | 
1993 3,100,000 E urt i | 
1997 7,500,000 E 4 
1999 24,000,000 [ o ] 
2000 42,000,000 10* = 4 
2002 220,000,000 Ea il J 
2003 410,000,000 i L 1 


E| | | | | | | 
1970 1975 1980 1985 1990 1995 2000 2005 
Year 
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The plot gives the number of transistors on a logarithmic scale. Find the least squares 
straight-line fit of the data using the model 


logio N ~ 61 + 02(t — 1970), 


where t is the year and N is the number of transistors. Note that 61 is the model’s 
prediction of the log of the number of transistors in 1970, and 10% gives the model’s 
prediction of the fractional increase in number of transistors per year. 


(a) Find the coefficients 0, and 02 that minimize the RMS error on the data, and give 
the RMS error on the data. Plot the model you find along with the data points. 


(b) Use your model to predict the number of transistors in a microprocessor introduced 
in 2015. Compare the prediction to the IBM Z13 microprocessor, released in 2015, 
which has around 4 x 10° transistors. 


(c) Compare your result with Moore’s law, which states that the number of transistors 
per integrated circuit roughly doubles every one and a half to two years. 


The computer scientist and Intel corporation co-founder Gordon Moore formulated the 
law that bears his name in a magazine article published in 1965. 


Asset a and B and market correlation. Suppose the T-vectors r'™? = (rP*,...,r#¢) and 
rat — (rP... rk) are return time series for a specific asset and the whole market, 


as described on page 251. We let r™! denote the risk-free interest rate, u™** and o™* the 
market return and risk (i.e., avg(r™**) and std(r™**)), and p and ø the return and risk 
of the asset (i.e., avg(r'™) and std(r'™“)). Let p be the correlation coefficient between 
the market and asset return time series r™** and ri"¢. Express the asset a and £ in terms 


of rf, u, o, wm, o™* and p. 


Polynomial model with multiple features. The idea of polynomial models can be extended 
from the case discussed on page 255 where there is only one feature. In this exercise we 
consider a quadratic (degree two) model with 3 features, i.e., x is a 3-vector. This has 
the form 


F 2 2 
f(x) =a + bızı + boxe + b3£3 + c11] + C2£2 + C33 + C4L1L2 + C5L1L3 + C6L2L3, 


where the scalar a, 3-vector b, and 6-vector c are the zeroth, first, and second order 
coefficients in the model. Put this model into our general linear in the parameters form, 
by giving p, and the basis functions fi,..., fp (which map 2-vectors to scalars). 


Average prediction error. Consider a data fitting problem, with first basis function ¢1(x) = 
1, and data set e™,..., 2°, y™,...,y. Assume the matrix A in (13.1) has linearly 
independent columns and let 6 denote the parameter values that minimize the mean square 
prediction error over the data set. Let the N-vector #4 denote the prediction errors using 
the optimal model parameter 6. Show that avg(f°) = 0. In other words: With the least 
squares fit, the mean of the prediction errors over the data set is zero. Hint. Use the 
orthogonality principle (12.9), with z = e1. 


Data matrix in auto-regressive time series model. An auto-regressive model with memory 
M is fit by minimizing the sum of the squares of the predictions errors on a data set 
with T samples, z1,..., 27, as described on page 259. Find the matrix A and vector y for 
which ||A6 — y||? gives the sum of the squares of the prediction errors. Show that A is a 
Toeplitz matrix (see page 138), i.e., entries A;; with the same value of i—j are the same. 


Fitting an input-output convolution system. Let ui,...,ur and y1,...,yr be observed 
input and output time series for a system that is thought to be an input-output convolution 
system, meaning 


ye Ge = X hyuy, t=1,...,T7, 
j=l 
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where we interpret ur as zero for t < 0. Here the n-vector h is the system impulse response; 
see page 140. This model of the relation between the input and output time series is also 
called a moving average (MA) model. Find a matrix A and vector b for which 


|| Ah — bll? = (y1 — Ha)? +--+ + (yr — Gr)’. 


Show that A is Toeplitz. (See page 138.) 


Conclusions from 5-fold cross-validation. You have developed a regression model for 
predicting a scalar outcome y from a feature vector x of dimension 20, using a collection 
of N = 600 data points. The mean of the outcome variable y across the given data is 
1.85, and its standard deviation is 0.32. After running 5-fold cross-validation we get the 
following RMS test errors (based on forming a model based on the data excluding fold i, 
and testing it on fold i). 


Fold excluded RMS test error 


1 0.13 
2 0.11 
3 0.09 
4 0.11 
5 0.14 


(a) How would you expect your model to do on new, unseen (but similar) data? Respond 
briefly and justify your response. 


(b) A co-worker observes that the regression model parameters found in the 5 different 
folds are quite close, but not the same. He says that for the production system, you 
should use the regression model parameters found when you excluded fold 3 from 
the original data, since it achieved the best RMS test error. Comment briefly. 


Augmenting features with the average. You are fitting a regression model 7 = «78 + v 
to data, computing the model coefficients 8 and v using least squares. A friend suggests 
adding a new feature, which is the average of the original features. (That is, he suggests 
using the new feature vector = (x,avg(x)).) He explains that by adding this new 
feature, you might end up with a better model. (Of course, you would test the new model 
using validation.) Is this a good idea? 


Interpreting model fitting results. Five different models are fit using the same training 
data set, and tested on the same (separate) test set (which has the same size as the 
training set). The RMS prediction errors for each model, on the training and test sets, 
are reported below. Comment briefly on the results for each model. You might mention 
whether the model’s predictions are good or bad, whether it is likely to generalize to 
unseen data, or whether it is over-fit. You are also welcome to say that you don’t believe 
the results, or think the reported numbers are fishy. 


Model ‘Train RMS Test RMS 


A 1.355 1.423 
B 9.760 9.165 
C 5.033 0.889 
D 0.211 5.072 
E 0.633 0.633 


Standardizing Boolean features. (See page 269.) Suppose that the N-vector x gives the 
value of a (scalar) Boolean feature across a set of N examples. (Boolean means that each 
zi has the value 0 or 1. This might represent the presence or absence of a symptom, or 
whether or not a day is a holiday.) How do we standardize such a feature? Express your 
answer in terms of p, the fraction of x; that have the value 1. (You can assume that p > 0 
and p < 1; otherwise the feature is constant.) 
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Interaction model with Boolean features. Consider a data fitting problem in which all n 
original features are Boolean, i.e., entries of x have the value 0 or 1. These features could 
be the results of Boolean tests for a patient (or absence or presence of symptoms), or 
a person’s response to a survey with yes/no questions. We wish to use these to predict 
an outcome, the number y. Our model will include a constant feature 1, the original n 
Boolean features, and all interaction terms, which have the form zix; where 1 <i<j <n. 


(a) What is p, the total number of basis functions, in this model? Explicitly list the 
basis functions for the case n = 3. You can decide their order. Hint. To count the 
number of pairs i, j that satisfy 1 < i < j < n, use equation (5.7). 


(b) Interpret (together) the following three coefficients of 6: the one associated with 
the original feature x3; the one associated with the original feature x5; and the one 
associated with the new product feature x3x%5. Hint. Consider the four possible 
values of the pair x3, £5. 


Least squares timing. A computer takes around one second to fit a regression model (using 
least squares) with 20 parameters using 10° data points. 


(a) About how long do you guess it will take the same computer to fit the same 20- 
parameter model using 10’ data points (i.e., 10x more data points)? 


(b) About how long do you guess it will take the same computer to fit a 200-parameter 
model using 10° data points (i.e., 10x more model parameters)? 


Estimating a matriz. Suppose that the n-vector x and the m-vector y are thought to be 
approximately related by a linear function, i.e., y ~% Ax, where A is an m x n matrix. We 
do not know the matrix A, but we do have observed data, 


a), y®,... y. 


We can estimate or guess the matrix A by choosing it to minimize 
N 
XC Av = y|? = AX - YI), 
i=1 


where X = [a --. ¢)] and Y = [y® --- yO]. We denote this least squares estimate 
as Â. (The notation here can be confusing, since X and Y are known, and A is to be 
found; it is more conventional to have symbols near the beginning of the alphabet, like 
A, denote known quantities, and symbols near the end, like X and Y, denote variables or 
unknowns.) 


(a) Show that A = YX‘, assuming the rows of X are linearly independent. Hint. Use 
|| AX -= Y|? = ||X7 A? —Y7|/?, which turns the problem into a matrix least squares 
problem; see page 233. 


(b) Suggest a good way to compute A, and give the complexity in terms of n, m, and N. 


Relative fitting error and error fitting the logarithm. (See page 259.) The relative 
fitting error between a positive outcome y and its positive prediction y is given by 
n = max{j/y,y/g} —1. (This is often given as a percentage.) Let r be the residual 


between their logarithms, r = log y — logy. Show that 7 = el" — 1, 


Fitting a rational function with a polynomial. Let x1,...,211 be 11 points uniformly 
spaced in the interval [—1,1]. (This means x; = —1.0 + 0.2(i — 1) for i = 1,...,11.) 
Take yi = (1 + 2;)/(1 + 52?), for i = 1,...,11. Find the least squares fit of polynomials 
of degree 0,1,...,8 to these points. Plot the fitting polynomials, and the true function 
y = (1 + x)/(1 + 52”), over the interval [—1.1, 1.1] (say, using 100 points). Note that the 
interval for the plot, [—1.1, 1.1], extends a bit outside the range of the data used to fit the 
polynomials, [—1, 1]; this gives us an idea of how well the polynomial fits can extrapolate. 
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Generate a test data set by choosing w1,...,uio uniformly spaced over [—1.1,1.1], with 
vi = (1+ui)/(1+5u?). Plot the RMS error of the polynomial fits found above on this test 
data set. On the same plot, show the RMS error of the polynomial fits on the training 
data set. Suggest a reasonable value for the degree of the polynomial fit, based on the 
RMS fits on the training and test data. Remark. There is no practical reason to fit a 
rational function with a polynomial. This exercise is only meant to illustrate the ideas of 
fitting with different basis functions, over-fit, and validation with a test data set. 

Vector auto-regressive model. The auto-regressive time series model (see page 259) can 
be extended to handle times series z1, z2,... where z are n-vectors. This arises in ap- 
plications such as econometrics, where the entries of the vector give different economic 
quantities. The vector auto-regressive model (already mentioned on page 164) has the 
same form as a scalar auto-regressive model, 


Zena = bizt +--+ t+ Bm%—-mgi, t=M,M+1,... 


where M is the memory of the model; the difference here is that the model parameters 
Bi,..-,;8m are all n x n matrices. The total number of (scalar) parameters in the vector 
auto-regressive model is Mn”. These parameters are chosen to minimize the sum of the 
squared norms of the prediction errors over a data set 21,..., ZT, 


lzm — êm? +++ + ller = Zr). 


(a) Give an interpretation of (82)13. What would it mean if this coefficient is very small? 


(b) Fitting a vector auto-regressive model can be expressed as a matrix least squares 
problem (see page 233), i.e., the problem of choosing the matrix X to minimize 
|| AX — B||?, where A and B are matrices. Find the matrices A and B. You can 
consider the case M = 2. (It will be clear how to generalize your answer to larger 
values of M.) 


Hints. Use the (2n) x n matrix variable X = | 6: 2 |", and right-hand side 
((T — 2) x n matrix) B = | z3 +- zr J". Your job is to find the matrix A so 
that || AX — B||? is the sum of the squared norms of the prediction errors. 


Sum of sinusoids time series model. Suppose that 21, z2,... is a time series. A very 
common approximation of the time series is as a sum of K sinusoids 


K 
Ze Be = X ax cos(wet — on), $= 15.2) eass 
k=1 


The kth term in this sum is called a sinusoid signal. The coefficient a, > 0 is called 
the amplitude, wk > 0 is called the frequency, and @x is called the phase of the kth 
sinusoid. (The phase is usually chosen to lie in the range from —z to m.) In many 
applications the frequencies are multiples of w1, i.e., wk = kw; for k = 2,...,K, in which 
case the approximation is called a Fourier approximation, named for the mathematician 
Jean-Baptiste Joseph Fourier. 

Suppose you have observed the values z1,...,27r, and wish to choose the sinusoid ampli- 
tudes a1,...,aK and phases ¢1,...,¢@xK so as to minimize the RMS value of the approxi- 
mation error (21 — 21,...,2r — zr). (We assume that the frequencies are given.) Explain 
how to solve this using least squares model fitting. 

Hint. A sinusoid with amplitude a, frequency w, and phase ¢ can be described by its 
cosine and sine coefficients a and 8, where 


acos(wt — ¢) = acos(wt) + Bsin(wt), 


where (using the cosine of sum formula) a = acos¢, 8 = asing. We can recover the 
amplitude and phase from the cosine and sine coefficients as 


a= v/a? + B?, ọ = arctan(3/a). 


Express the problem in terms of the cosine and sine coefficients. 
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Fitting with continuous and discontinuous piecewise-linear functions. Consider a fitting 
problem with n = 1, so 2™,...,2 and yD, Sinan y™ are numbers. We consider two 
types of closely related models. The first is a piecewise-linear model with knot points 
at —1 and 1, as described on page 256, and illustrated in figure 13.8. The second is a 
stratified model (see page 272), with three independent affine models, one for x < —1, 
one for —1 < x < 1, and one for x > 1. (In other words, we stratify on x taking low, 
middle, or high values.) Are these two models the same? Is one more general than the 
other? How many parameters does each model have? Hint. See problem title. 


What can you say about the training set RMS error and test set RMS error that would 
be achieved using least squares with these two models? 


Efficient cross-validation. The cost of fitting a model with p basis functions and N 
data points (say, using QR factorization) is 2N. p? flops. In this exercise we explore the 
complexity of carrying out 10-fold cross-validation on the same data set. We divide 
the data set into 10 folds, each with N/10 data points. The naive method is to fit 10 
different models, each using 9 of the folds, using the QR factorization, which requires 
10 - 2(0.9)Np* = 18Np? flops. (To evaluate each of these models on the remaining fold 
requires 2(N/10)p flops, which can be ignored compared to the cost of fitting the models.) 
So the naive method of carrying out 10-fold cross-validation requires, not surprisingly, 
around 10x the number of flops as fitting a single model. 


The method below outlines another method to carry out 10-fold cross-validation. Give the 
total flop count for each step, keeping only the dominant terms, and compare the total cost 
of the method to that of the naive method. Let A1,..., A10 denote the (N/10) x p blocks 
of the data matrix associated with the folds, and let bi,...,bi9 denote the right-hand 
sides in the least squares fitting problem. 


(a) Form the Gram matrices G; = AT A; and the vectors ci = AT Dy. 
(b) Form G = Gi +---+ Gio and c= cy +--+ + Cio. 
(c) For k=1,...,10, compute 0, = (G — Gx)~*(e— cx). 


Prediction contests. Several companies have run prediction contests open to the public. 
Netflix ran the best known contest, offering a $1M prize for the first prediction of user 
movie rating that beat their existing method RMS prediction error by 10% on a test set. 
The contests generally work like this (although there are several variations on this format, 
and most are more complicated). The company posts a public data set, that includes the 
regressors or features and the outcome for a large number of examples. They also post the 
features, but not the outcomes, for a (typically smaller) test data set. The contestants, 
usually teams with obscure names, submit predictions for the outcomes in the test set. 
Usually there is a limit on how many times, or how frequently, each team can submit 
a prediction on the test set. The company computes the RMS test set prediction error 
(say) for each submission. The teams’ prediction performance is shown on a leaderboard, 
which lists the 100 or so best predictions in order. 

Discuss such contests in terms of model validation. How should a team check a set of pre- 
dictions before submitting it? What would happen if there were no limits on the number 
of predictions each team can submit? Suggest an obvious method (typically disallowed 
by the contest rules) for a team to get around the limit on prediction submissions. (And 
yes, it has been done.) 


14.1 


Chapter 14 


Least squares classification 


In this chapter we consider the problem of fitting a model to data where the outcome 
takes on values like TRUE or FALSE (as opposed to being numbers, as in chapter 13). 
We will see that least squares can be used for this problem as well. 


Classification 


In the data fitting problem of chapter 13, the goal is to reproduce or predict the 
outcome y, which is a (scalar) number, based on an n-vector x. In a classification 
problem, the outcome or dependent variable y takes on only a finite number of 
values, and for this reason is sometimes called a label, or in statistics, a categorical. 
In the simplest case, y has only two values, for example TRUE or FALSE, or SPAM 
or NOT SPAM. This is called the two-way classification problem, the binary classifi- 
cation problem, or the Boolean classification problem, since the outcome y can take 
on only two values. We start by considering the Boolean classification problem. 

We will encode y as a real number, taking y = +1 to mean TRUE and y = —1 
to mean FALSE. (It is also possible to encode the outcomes using y = +1 and 
y = 0, or any other pair of two different numbers.) As in real-valued data fitting, 
we assume that an approximate relationship of the form y ~ f(x) holds, where 
f:R” > {-1,4+1}. (This notation means that the function f takes an n-vector 
argument, and gives a resulting value that is either +1 or —1.) Our model will 
have the form ĝ = f(z), where f : R” > {—1, +1}. The model f is also called 
a classifier, since it classifies n-vectors into those for which f(x) = +1 and those 
for which f(x) = —1. As in real-valued data fitting, we choose or construct the 
classifier f using some observed data. 


Examples. Boolean classifiers are widely used in many application areas. 


e Email spam detection. The vector x contains features of an email message. 
It can include word counts in the body of the email message, other features 
such as the number of exclamation points or all-capital words, and features 
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related to the origin of the email. The outcome is +1 if the message is SPAM, 
and —1 otherwise. The data used to create the classifier comes from users 
who have explicitly marked some messages as junk. 


e Fraud detection. The vector x gives a set of features associated with a credit 
card holder, such as her average monthly spending levels, median price of 
purchases over the last week, number of purchases in different categories, av- 
erage balance, and so on, as well as some features associated with a particular 
proposed transaction. The outcome y is +1 for a fraudulent transaction, and 
—1 otherwise. The data used to create the classifier is taken from historical 
data, that includes (some) examples of transactions that were later verified 
to be fraudulent and (many) that were verified to be bona fide. 


e Boolean document classification. The vector x is a word count (or histogram) 
vector for a document, and the outcome y is +1 if the document has some 
specific topic (say, politics) and —1 otherwise. The data used to construct the 
classifier might come from a corpus of documents with their topics labeled. 


e Disease detection. The examples correspond to patients, with outcome y = 
+1 meaning the patient has a particular disease, and y = —1 meaning they 
do not. The vector x contains relevant medical features associated with the 
patient, including for example age, sex, results of tests, and specific symp- 
toms. The data used to build the model come from hospital records or a 
medical study; the outcome is the associated diagnosis (presence or absence 
of the disease), confirmed by a doctor. 


e Digital communications receiver. In a modern electronic communications sys- 
tem, y represents one bit (traditionally represented by the values 0 and 1) 
that is to be sent from a transmitter to a receiver. The vector x represents 
n measurements of a received signal. The predictor 7 = f(z) is called the 
decoded bit. In communications, the classifier f is called a decoder or detec- 
tor. The data used to construct the decoder comes from a training signal, a 


sequence of bits known to the receiver, that is transmitted. 


Prediction errors. For a given data point x, y, with predicted outcome 7 = f(z), 
there are only four possibilities: 

e True positive. y= +1 and ĝ = +1. 

e True negative. y= —1 and ĝ = —1. 

e False positive. y = —1 and ĝ = +1. 

e False negative. y= +1 and ĝ = —1. 
In the first two cases the predicted label is correct, and in the last two cases, the 
predicted label is an error. We refer to the third case as a false positive or type I 
error, and we refer to the fourth case as a false negative or type II error. In some 


applications we care equally about making the two types of errors; in others we 
may care more about making one type of error than another. 
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Prediction 


Outcome = +1 ĝ = —1 Total 
y= +1 Nip Nen Np 
y= =l Nip Nin Na 

All Nip+Nip NintNp N 


Table 14.1 Confusion matrix. The off-diagonal entries Ng and N¢p give the 
numbers of the two types of error. 


Error rate and confusion matrix. For a given data set 


a)... 20, Yl yeaa y "5 
and model f , we can count the numbers of each of the four possibilities that occur 
across the data set, and display them in a contingency table or confusion matriz, 
which is a 2 x 2 table with the columns corresponding to the value of j and the 
rows corresponding to the value of y). (This is the convention used in machine 
learning; in statistics, the rows and columns are sometimes reversed.) The entries 
give the total number of each of the four cases listed above, as shown in table 14.1. 
The diagonal entries correspond to correct decisions, with the upper left entry the 
number of true positives, and the lower right entry the number of true negatives. 
The off-diagonal entries correspond to errors, with the upper right entry the number 
of false negatives, and the lower left entry the number of false positives. The total 
of the four numbers is N, the number of examples in the data set. Sometimes the 
totals of the rows and columns are shown, as in table 14.1. 

Various performance metrics are expressed in terms of the numbers in the con- 
fusion matrix. 


e The error rate is the total number of errors (of both kinds) divided by the 
total number of examples, i.e., (Nep + Nin)/N. 


e The true positive rate (also known as the sensitivity or recall rate) is Ntp/Np. 
This gives the fraction of the data points with y = +1 for which we correctly 
guessed y = +1. 


e The false positive rate (also known as the false alarm rate) is Ng /Nn. The 
false positive rate is the fraction of data points with y = —1 for which we 
incorrectly guess 7 = +1. 


e The specificity or true negative rate is one minus the false positive rate, i.e., 
Nin/Nn. The true negative rate is the fraction of the data points with y = —1 
for which we correctly guess 7 = —1. 


e The precision is Ntp/(Ntp + Ne), the fraction of true predictions that are 
correct. 
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14.2 


Prediction 
Outcome § =+1 (SPAM) = -—1 (not SPAM) Total 
y = +1 (SPAM) 95 32 127 
y = —1 (not SPAM) 19 1120 1139 
All 114 1152 1266 


Table 14.2 Confusion matrix of a SPAM detector on a data set of 1266 ex- 
amples. 


A good classifier will have small (near zero) error rate and false positive rate, and 
high (near one) true positive rate, true negative rate, and precision. Which of these 
metrics is more important depends on the particular application. 

An example confusion matrix is given in table 14.2 for the performance of a 
spam detector on a data set of N = 1266 examples (emails) of which 127 are SPAM 
(y = +1) and the remaining 1139 are NOT SPAM (y = —1). On the data set, 
this classifier has 95 true positives, 1120 true negatives, 19 false positives, and 32 
false negatives. Its error rate is (19 + 32)/1266 = 4.03%. Its true positive rate 
is 95/127 = 74.8% (meaning it is detecting around 75% of the spam in the data 
set), and its false positive rate is 19/1139 = 1.67% (meaning it incorrectly labeled 
around 1.7% of the non-spam messages as spam). 


Validation in classification problems. In classification problems we are concerned 
with the error, true positive, and false positive rates. So out-of-sample validation 
and cross-validation are carried out using the performance metric or metrics that 
we care about, i.e., the error rate or some combination of true positive and false 
negative rates. We may care more about one of these metrics than the others. 


Least squares classifier 


Many sophisticated methods have been developed for constructing a Boolean model 
or classifier from a data set. Logistic regression and support vector machine are 
two methods that are widely used, but beyond the scope of this book. Here we 
discuss a very simple method, based on least squares, that can work quite well, 
though not as well as the more sophisticated methods. 

We first carry out ordinary real-valued least squares fitting of the outcome, 
ignoring for the moment that the outcome y takes on only the values —1 and +1. 
We choose basis functions f1, ..., fp, and then choose the parameters 01, ...,0p so 
as to minimize the sum squared error 


WD — Fe)? +--+ UM — FEM), 


where f(a) = 01 fi(z) +--+ +Opfp(a). We use the notation f, since this function 
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is not our final model f. The function f is the least squares fit over our data set, 
and f(x), for a general vector x, is a number. 
Our final classifier is then taken to be 


f(x) = sign(f(2)), (14.1) 


where sign(a) = +1 for a > 0 and —1 for a < 0. We call this classifier the least 
squares classifier. g 

The intuition behind the least squares classifier is simple. The value f(x) is a 
number, which (ideally) is near +1 when y“ = +1, and near —1 when y“ = —1. 
If we are forced to guess one of the two possible outcomes +1 or —1, it is natural to 
choose sign(f(a)). (Indeed, sign(f()) is the nearest neighbor of f(x) among the 
points —1 and +1.) Intuition suggests that the number f(x) can be related to our 
confidence in our guess ĝ = sign(f(x)): When f(x) is near 1 we have confidence 
in our guess 7 = +1; when it is small and negative (say, f(a) = —0.03), we guess 
gy = —1, but our confidence in the guess will be low. We won’t pursue this idea 
further in this book, except in multi-class classifiers, which we discuss in §14.3. 

The least squares classifier is often used with a regression model, i.e., f(x) = 
xz? 8 +v, in which case the classifier has the form 


f(x) =sign(2? 8 + v). (14.2) 


We can easily interpret the coefficients in this model. For example, if 87 is negative, 
it means that the larger the value of x7 is, the more likely we are to guess 7 = —1. 
If 84 is the coefficient with the largest magnitude, then we can say that x4 is the 
feature that contributes the most to our classification decision. 


lris flower classification 


We illustrate least squares classification with a famous data set, first used in the 
1930s by the statistician Ronald Fisher. The data are measurements of four at- 
tributes of three types of iris flowers: Iris Setosa, Iris Versicolour, and Iris Vir- 
ginica. The data set contains 50 examples of each class. The four attributes are: 


e xı is the sepal length in cm, 


e zə is the sepal width in cm, 


x3 is the petal length in cm, 


x4 is the petal width in cm. 


We compute a Boolean classifier of the form (14.2) that distinguishes the class Iris 
Virginica from the other two classes. Using the entire set of 150 examples we find 
the coefficients 


v =—2.39, 8B, =—0.0918, B2=0.406, 83; = 0.00798, Bs = 1.10. 


The confusion matrix associated with this classifier is shown in table 14.3. The 
error rate is 7.3%. 
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Prediction 


Outcome g=+l y=-1 Total 


y=+1 46 4 50 
y=-1 7 93 100 
All 53 97 150 


Table 14.3 Confusion matrix for a Boolean classifier of the Iris data set. 


Model parameters Error rate (%) 
Fold v By Bo Bs Ba Train Test 


1 —2.45 0.0240 0.264 —0.00571 0.994 6.7 3.3 
2 —2.38 —0.0657 0.3898 —0.07593 1.251 5.8 10.0 
3 —2.63 0.0340 0.3826 —0.08869 1.189 7.5 3.3 
4 —1.89 —0.3338 0.577 0.09902 1.151 6.7 16.7 
5 —2.42 —0.1464 0.456 0.11200 0.944 8.3 3.3 


Table 14.4 Five-fold validation for the Boolean classifier of the Iris data set. 


Validation. To test our least squares classification method, we apply 5-fold cross- 
validation. We randomly divide the data set into 5 folds of 30 examples (10 for 
each class). The results are shown in table 14.4. The test data sets contain only 
30 examples, so a single prediction error changes the test error rate significantly 
(i.e., by 3.3%). This explains what would seem to be large variation seen in the 
test set error rates. We might guess that the classifier will perform on new unseen 
data with an error rate in the 7-10% range, but our test sets are not large enough 
to predict future performance more accurately than this. (This is an example of 
the limitation of cross-validation when the data set is small; see the discussion on 
page 268.) 


Handwritten digit classification 


We now consider a much larger example, the MNIST data set described in §4.4.1. 
The (training) data set contains 60000 images of size 28 by 28. (A few samples are 
shown in figure 4.6.) The number of examples per digit varies between 5421 (for 
digit five) and 6742 (for digit one). The pixel intensities are scaled to lie between 0 
and 1. We remove the pixels that are nonzero in fewer than 600 training examples. 
The remaining 493 pixels are shown as the white area in figure 14.1. There is also 
a separate test set containing 10000 images. Here we will consider classifiers to 
distinguish the digit zero from the other nine digits. 

In this first experiment, we use the 493 pixel intensities, plus an additional 
feature with value 1, as the n = 494 features in the least squares classifier (14.1). 
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Figure 14.1 Location of the pixels used as features in the handwritten digit 
classification example. 


Prediction 


Outcome g=+1 y=-1 Total 
y= +1 5158 765 5923 
y=-l 167 53910 54077 

All 5325 54675 60000 


Table 14.5 Confusion matrix for a classifier for recognizing the digit zero, 
on a training set of 60000 examples. 


The performance on the (training) data set is shown in the confusion matrix in 
table 14.5. The error rate is 1.6%, the true positive rate is 87.1%, and the false 
positive rate is 0.3%. 

Figure 14.2 shows the distribution of the values of f(a“) for the two classes in 
the training set. The interval [—2.1, 2.1] is divided in 100 intervals of equal width. 
For each interval, the height of the blue bar is the fraction of the total number of 
training examples x from class +1 (digit zero) that have a value f(x) in the 
interval. The height of the red bar is the fraction of the total number of training 
examples from class —1 (digits 1-9) with f(«) in the interval. The vertical dashed 
line shows the decision boundary: For f(a“) to the left (i.e., negative) we guess 
that digit i is from class —1, i.e., digits 1-9; for f(a) to the right of the dashed 
line, we guess that digit 7 is from class +1, i.e., digit 0. False positives correspond 
to red bars to the right of the dashed line, and false negatives correspond to blue 
bars to the left of the line. 

Figure 14.3 shows the values of the coefficients 6;, displayed as an image. We 
can interpret this image as a map of the sensitivity of our classifier to the pixel 
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Figure 14.2 The distribution of the values of f(x) in the Boolean classi- 
fier (14.1) for recognizing the digit zero, over all elements x of the training 
set. The red bars correspond to the digits from class —1, i.e., the digits 1-9; 
the blue bars correspond to the digits from class +1, i.e., the digit zero. 
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Figure 14.3 The coefficients 6, in the least squares classifier that distin- 
guishes the digit zero from the other nine digits. 


Prediction 


Outcome g=+1 y=-1 Total 


y= 864 116 980 
y=-l 42 8978 9020 


All 906 9094 10000 


Table 14.6 Confusion matrix for the classier for recognizing the digit zero, 
on a test set of 10000 examples. 


values. Pixels with 6; = 0 are not used at all; pixels with larger positive values of 
8; are locations where the larger the image pixel value, the more likely we are to 
guess that the image represents the digit zero. 


Validation. The performance of the least squares classifier on the test set is shown 
in the confusion matrix in table 14.6. For the test set the error rate is 1.6%, the 
true positive rate is 88.2%, and the false positive rate is 0.5%. These performance 
metrics are similar to those for the training data, which suggests that our classifier 
is not over-fit, and gives us some confidence in our classifier. 


Feature engineering. We now do some simple feature engineering (as in §13.3) 
to improve our classifier. As described on page 273, we add 5000 new features 
to the original 494 features, as follows. We first generate a 5000 x 494 matrix 
R, with randomly chosen entries +1. The 5000 new functions are then given by 
max{0, (Rx);}, for j =1,...,5000. After the addition of the 5000 new features (so 
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Prediction Prediction 
Outcome g=+1 yg=-1 Total Outcome g=+l yg=-1 Total 
y= +1 5813 110 5923 y=+l 963 17 980 
y=-l 15 54062 54077 y=-l 7 9013 9020 


All 5828 54172 60000 All 970 9030 10000 


Table 14.7 Confusion matrices for the Boolean classifier to recognize the 
digit zero after addition of 5000 new features. The table on the left is for 
the training set. The table on the right is for the test set. 


the total number is 5494), we get the confusion matrices for the training and test 
data sets shown in table 14.7. The error rates are consistent, and equal to 0.21% 
for the training set and 0.24% for the test set, a very substantial improvement 
compared to the 1.6% in the first experiment. A comparison of the distributions in 
figures 14.4 and 14.2 also shows how much better the new classifier distinguishes 
between the two classes of the training set. We conclude that this was a successful 
exercise in feature engineering. 


Receiver operating characteristic 


One useful modification of the least squares classifier (14.1) is to skew the decision 
boundary, by subtracting a constant a from f(x) before taking the sign: 


f(x) = sign( f(x) — a). (14.3) 


The classifier is then 


ssn J +1 f(z) >a 
fla) = { —1 f(a)<a. 


We call a the decision threshold for the modified classifier. The basic least squares 
classifier (14.1) has decision threshold a = 0. 

By choosing a positive, we make the guess f(x) = +1 less frequently, so the 
numbers in the first column of the confusion matrix go down, and the numbers in 
the second column go up (since the sum of the numbers in each row is always the 
same). This means that choosing a positive decreases the true positive rate (which 
is bad), but it also decreases the false positive rate (which is good). Choosing a 
negative has the opposite effect, increasing the true positive rate (which is good) 
and increasing the false positive rate (which is bad). The parameter œ is chosen 
depending on how much we care about the two competing metrics, in the particular 
application. 

By sweeping a over a range, we obtain a family of classifiers that vary in their 
true positive and false positive rates. We can plot the false positive and negative 
rates, as well as the error rate, as a function of a. A more common way to plot this 
data has the strange name receiver operating characteristic (ROC). The ROC shows 
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Figure 14.4 The distribution of the values of f(x) in the Boolean classi- 
fier (14.1) for recognizing the digit zero, after addition of 5000 new features. 


the true positive rate on the vertical axis and false positive rate on the horizontal 
axis. The name comes from radar systems deployed during World War II, where 
y = +1 means that an enemy vehicle (or ship or airplane) is present, and ĝ = +1 
means that an enemy vehicle is detected. 


Example. We examine the skewed threshold least squares classifier (14.3) for the 
example described above, where we attempt to detect whether or not a handwritten 
digit is zero. Figure 14.5 shows how the error, true positive, and false positive rates 
depend on the decision threshold a, for the training set data. We can see that as 
a increases, the true positive rate decreases, as does the false positive rate. We 
can see that for this particular case the total error rate is minimized by choosing 
a = —0.1, which gives error rate 1.4%, slightly lower than the basic least squares 
classifier. The limiting cases when a is negative enough, or positive enough, are 
readily understood. When a is very negative, the prediction is always y = +1; our 
error rate is then the fraction of the data set with y = —1. When a is very positive, 
the prediction is always 7 = —1, which gives an error rate equal to the fraction of 
the data set with y = +1. 

The same information (without the total error rate) is plotted in the traditional 
ROC curve shown in figure 14.6. The dots show the basic least squares classifier, 
with a = 0, and the skewed threshold least squares classifiers for a = —0.25 and 
a = 0.25. These curves are for the training data; the same curves for the test 
data look similar, giving us some confidence that our classifiers will have similar 
performance on new, unseen data. 
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Figure 14.5 True positive, false positive, and total error rate versus decision 
threshold a. The vertical dashed line is shown for decision threshold a = 
0.25. 
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Figure 14.6 ROC curve. 


Multi-class classifiers 


In a multi-class classification problem, we have K > 2 possible labels. This is 
sometimes referred to more compactly as K-class classification. (The case K = 2 
is Boolean classification, discussed above.) For our generic discussion of multi-class 
classifiers, we will encode the labels as y = 1,2,..., K. In some applications there 
are more natural encodings; for example, the Likert scale labels Strongly Disagree, 
Disagree, Neutral, Agree, and Strongly Agree are typically encoded as —2, —1, 0,1, 2, 
respectively. 

A multi-class classifier is a function f : R” 3 {1,..., K}. Given a feature vector 
a, f(x) (which is an integer between 1 and K) is our prediction of the associated 
outcome. A multi-class classifier classifies n-vectors into K groups, corresponding 
to the values 1,..., K. 


Examples. Multi-class classifiers are used in many application areas. 


e Handwritten digit classification. We are given an image of a hand-written 
digit (and possibly other features generated from the images), and wish to 
guess which of ten digits it represents. This classifier is used to do automatic 
(computer-based) reading of handwritten digits. 


e Marketing demographic classification. Data from purchases made, or web 
sites visited, is used to train a multi-class classifier for a set of market seg- 
ments, such as college-educated women aged 25-30, men without college de- 
grees aged 45-55, and so on. This classifier guesses the demographic segment 
of a new customer, based only on their purchase history. This can be used to 
select which promotions to offer a customer for whom we only have purchase 
data. The classifier is trained using data from known customers. 
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e Disease diagnosis. The labels are a set of diseases (including one label that 
corresponds to disease-free), and the features are medically relevant values, 
such as patient attributes and the results of tests. Such a classifier carries 
out diagnosis (of the diseases corresponding to the labels). The classifier is 
trained on cases in which a definitive diagnosis has been made. 


e Translation word choice. A machine translation system translates a word in 
the source language to one of several possible words in the target language. 
The label corresponds to a particular choice of translation for the word in the 
source language. The features contain information about the context around 
the word, for example, words counts or occurrences in the same paragraph. 
As an example, the English word ‘bank’ might be translated into another 
language one way if the word ‘river’ appears nearby, and another way if 
‘financial’ or ‘reserves’ appears nearby. The classifier is trained on data taken 
from translations carried out by (human) experts. 


e Document topic prediction. Each example corresponds to a document or 
article, with the feature vector containing word counts or histograms, and 
the label corresponding to the topic or category, such as POLITICS, SPORTS, 
ENTERTAINMENT, and so on. 


e Detection in communications. Many electronic communications systems trans- 
mit messages as a sequence of K possible symbols. The vector x contains 
measurements of the received signal. In this context the classifier f is called 
a detector or decoder; the goal is to correctly determine which of the K 


symbols was transmitted. 


Prediction errors and confusion matrix. For a multi-class classifier T and a given 
data point (x,y), with predicted outcome ĝ = f(x), there are K? possibilities, cor- 
responding to all the pairs of values of y, the actual outcome, and ĝ, the predicted 
outcome. For a given data set (training or validation set) with N elements, the 
numbers of each of the K? occurrences are arranged into a K x K confusion matrix, 
where N;; is the number of data points for which y = i and y = j. 

The K diagonal entries Ni,,..., NK correspond to the cases when the predic- 
tion is correct; the K? — K off-diagonal entries N;;, i 4 j, correspond to prediction 
errors. For each i, Nj; is the number of data points with label i for which we cor- 
rectly guessed ĝ = i. For i Æ j, Nj; is the number of data points for which we have 
mistaken label i (its true value) for the label j (our incorrect guess). For K = 2 
(Boolean classification) there are only two types of prediction errors, false positive 
and false negative. For K > 2 the situation is more complicated, since there are 
many more types of errors a predictor can make. From the entries of the confusion 
matrix we can derive various measures of the accuracy of the predictions. We let 
N; (with one index) denote the total number of data points for which y = i, i.e., 
Ni = Ni +--+ + Nix. We have N = N, +---+ Nx. 

The simplest measure is the overall error rate, which is the total number of 
errors (the sum of all off-diagonal entries in the confusion matrix) divided by the 
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y 
y Dislike Neutral Like 
Dislike 183 10 5 
Neutral 7 61 8 


Like 3 13 210 


Table 14.8 Example confusion matrix of a multi-class classifier with three 
classes. 


data set size (the sum of all entries in the confusion matrix): 


(1/N) X Nig =1—(1/N) X Na. 
iAj i 
This measure implicitly assumes that all errors are equally bad. In many applica- 
tions this is not the case; for example, some medical mis-diagnoses might be worse 
for a patient than others. 

We can also look at the rate with which we predict each label correctly. The 
quantity N;;/N; is called the true label i rate. It is the fraction of data points with 
label y = i for which we correctly predicted ĝ = i. (The true label 7 rates reduce 
to the true positive and true negative rates for Boolean classifiers.) 

A simple example, with K = 3 labels (Dislike, Neutral, and Like), and a total 
number N = 500 data points, is shown in table 14.8. Out of 500 data points, 454 
(the sum of the diagonal entries) were classified correctly. The remaining 46 data 
points (the sum of the off-diagonal entries) correspond to the 6 different types of 
errors. The overall error rate is 46/500 = 9.2%. The true label Dislike rate is 
183/(183 + 10 +5) = 92.4%, i.e., among the data points with label Dislike, we 
correctly predicted the label on 92.4% of the data. The true label Neutral rate is 
61/(7+61+8) = 80.3%, and the true label Like rate is 210/(3+13+210) = 92.9%. 


Least squares multi-class classifier 


The idea behind the least squares Boolean classifier can be extended to handle 
multi-class classification problems. For each possible label value, we construct a 
new data set with the Boolean label +1 if the label has the given value, and —1 
otherwise. (This is sometimes called a one-versus-others or one-versus-all classi- 
fier.) From these K Boolean classifiers we must create a classifier that chooses 
one of the K possible labels. We do this by selecting the label for which the least 
squares regression fit has the highest value, which roughly speaking is the one with 
the highest level of confidence. Our classifier is then 

f(x) = argmax f;,(2), 

k=1,...,.K 

where fr is the least squares regression model for label k against the others. The 
notation argmax means the index of the largest value among the numbers f;(z), 
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for k = 1,...,K. Note that fp(x) is the real-valued prediction for the Boolean 
classifier for class k versus not class k; it is not the Boolean classifier, which is 
sign(fx(x)). 

As an example consider a multi-class classification problem with 3 labels. We 
construct 3 different least squares classifiers, for 1 versus 2 or 3, for 2 versus 1 or 
3, and for 3 versus 1 or 2. Suppose for a given feature vector x, we find that 


f(e) =-0.7,  fo(z) =+0.2, f(x) = 40.8. 


The largest of these three numbers is f3(x), so our prediction is f(x) = 3. We can 
interpret these numbers and our final decision. The first classifier is fairly confident 
that the label is not 1. According to the second classifier, the label could be 2, 
but it does not have high confidence in this prediction. Finally, the third classifier 
predicts the label is 3, and moreover has relatively high confidence in this guess. 
So our final guess is label 3. (This interpretation suggests that if we had to make a 
second guess, it should be label 2.) Of course here we are anthropomorphizing the 
individual label classifiers, since they do not have beliefs or levels of confidence in 
their predictions. But the story is helpful in understanding the motivation behind 
the classifier above. 


Skewed decisions. In a Boolean classifier we can skew the decision threshold (see 
§14.2.3) to trade off the true positive and false positive rates. In a K-class classifier, 
an analogous method can be used to trade off the K true label 7 rates. We apply 
an offset a, to fı(x) before finding the largest value. This gives the predictor 


f(x) = argmax (F) — ax), 


k=1,...,K 


where a, are constants chosen to trade off the true label k rates. If we decrease az, 
we predict f (x) = k more often, so all entries of the kth column in the confusion 
matrix increase. This increases our rate of true positives for label k (since Nkk 
increases), which is good. But it can decrease the true positive rates for the other 
labels. 


Complexity. In least squares multi-class classification, we solve K least squares 
problems, each with N rows and p variables. The naive method of computing 
6,,...,0x, the coefficients in our one-versus-others classifiers, costs 2K Np? flops. 
But the K least squares problems we solve all involve the same matrix; only the 
right-hand side vector changes. This means that we can carry out the QR factoriza- 
tion just once, and use it for computing all K classifier coefficients. Alternatively, 
we can say that finding the coefficients of all the one-versus-others classifiers can 
be done by solving a matrix least squares problem (see page 233.) When K (the 
number of classes or labels) is small compared to p (the number of basis functions 
or coefficients in the classifier), the cost is about the same as solving just one least 
squares problem. 

Another simplification in K-class least squares classification arises due to the 
special form of the right-hand sides in the K least squares problems to be solved. 
The right-hand sides in these K problems are Boolean vectors with entries +1 for 
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Prediction 
Class Setosa Versicolour Virginica Total 
Setosa 40 0 0 40 
Versicolour 0 27 13 40 
Virginica 0 4 36 40 


All 40 31 49 120 


Table 14.9 Confusion matrix for a 3-class classifier of the Iris data set, on a 
training set of 120 examples. 


one of the classes and —1 for all others. It follows that the sum of these K right- 
hand sides is the vector with all entries equal to 2 — K, i.e., (2— K)1. Since the 
mapping from the right-hand sides to the least squares approximate solutions 6;, is 
linear (see page 229), we have 6; +---+6, = (2—K)a, where a is the least squares 
approximate solution when the right-hand side is 1. Assuming that the first basis 
function is fı(x) = 1, we have a = e1. So we have 


ĝi +--+ ôk =(2—-K)er, 


where 6, is the coefficient vector for distinguishing class k from the others. Once 
we have computed 61,...,9«% 1, we can find g by simple vector subtraction. 
This explains why for the Boolean classification case we have K = 2, but we 
only have to solve one least squares problem. In §14.2 we compute one coefficient 
vector 0; if the same problem were to be considered a K-class problem with K = 2, 
we would have 6; = 6. (This one distinguishes class 1 versus class 2.) The other 
coefficient vector is then 02 = —0,. (This one distinguishes class 2 versus class 1.) 


Iris flower classification 


We compute a 3-class classifier for the Iris data set described on page 289. The 
examples are randomly partitioned into a training set of size 120, containing 40 
examples of each class, and a test set of size 30, with 10 examples of each class. 
The 3 x 3 confusion matrix for the training set is given in table 14.9. The error 
rate is 14.2%. The results for the test set are in table 14.10. The error rate is 
13.3%, similar enough to the training error rate to give us some confidence in our 
classifier. The true Setosa rate is 100% for both train and test sets, suggesting 
that our classifier can detect this type well. The true Versicolour rate is 67.5% for 
the training data, and 60% for the test set. The true Virginica rate is 90% for the 
training data, and 100% for the test set. This suggests that our classifier can detect 
Virginica well, but perhaps not as well as Setosa. (The 100% true Virginica rate 
on the test set is a matter of luck, due to the very small number of test examples 
of each type; see the discussion on page 268.) 
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Prediction 
Class Setosa Versicolour Virginica Total 
Setosa 10 0 0 10 
Versicolour 0 6 4 10 
Virginica 0 0 10 10 
All 10 6 14 30 


Table 14.10 Confusion matrix for a 3-class classifier of the Iris data set, on 
a test set of 30 examples. 


Handwritten digit classification 


We illustrate the least squares multi-class classification method by applying it to 
the MNIST data set. For each of the ten digits 0,...,9 (which we encode as 
k =1,...,10) we compute a least squares Boolean classifier 


f(a) = sign(a™ bk + vk), 


to distinguish digit k from the other digits. The ten Boolean classifiers are combined 
into a multi-class classifier 


f(a) = argmax (£T By + vg). 

k=1,...,10 
The 10 x 10 confusion matrix for the data set and the test set are given in ta- 
bles 14.11 and 14.12. 

The error rate on the training set is 14.5%; on the test set it is 13.9%. The true 
label rates on the test set range from 73.5% for digit 5 to 97.5% for digit 1. Many of 
the entries of the confusion matrix make sense. From the first row of the matrix, we 
see a handwritten 0 was rarely mistakenly classified as a 1, 2, or 9; presumably these 
digits look different enough that they are easily distinguished. The most common 
error (80) corresponds to y = 9, ĝ = 4, i.e., mistakenly identifying a handwritten 
9 as a 4. This makes sense since these two digits can look very similar. 


Feature engineering. After adding the 5000 randomly generated new features (as 
described on page 293), the training set error is reduced to about 1.5%, and the 
test set error to 2.6%. The confusion matrices are given in tables 14.13 and 14.14. 
Since we have (substantially) reduced the error in the test set, we conclude that 
adding these 5000 new features was a successful exercise in feature engineering. 

You could reasonably wonder how much performance improvement is possible 
for this example, using feature engineering. For the handwritten digit data set, 
humans have an error rate around 2% (with the true digits verified by checking 
actual addresses, ZIP codes, and so on). Further feature engineering (i.e., introduc- 
ing even more additional random features, or using neural network features) brings 
the error rate down well below 2%, i.e., well below human ability. This should give 
you some idea of how powerful the ideas in this book are. 
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Prediction 
Digit 0 1 2 3 4 5 6 7 8 9 Total 
0 5669 8 21 19 25 46 65 4 60 6 5923 
1 2 6543 36 17 20 30 14 14 60 6 6742 
2 99 278 4757 153 116 17 234 92 190 22 5958 
3 38 172 174 5150 31 122 59 122 135 128 6131 
4 13 104 41 5 5189 52 45 24 60 309 5842 
5 164 94 30 448 103 3974 185 44 237 142 5421 
6 104 78 77 2 64 106 5448 0 36 3 5918 
7 55 191 36 48 165 9 4 5443 13 301 6265 
8 69 492 64 225 102 220 64 21 4417 177 5851 
9 67 66 26 115 365 12 4 513 39 4742 5949 
All 6280 8026 5262 6182 6180 4588 6122 6277 5247 5836 60000 
Table 14.11 Confusion matrix for least squares multi-class classification of 
handwritten digits (training set). 
Prediction 
Digit 0 1 2 3 4 5 6 T 8 9 Total 
0 944 0 1 2 2 8 13 2 7 1 980 
1 0 1107 2 2 3 1 5 1 14 0 1135 
2 18 54 815 26 16 0 38 22 39 4 1032 
3 4 18 22 884 5 16 10 22 20 9 1010 
4 0 22 6 0 883 3 9 1 12 46 982 
5 24 19 3 74 24 656 24 13 38 17 892 
6 17 9 10 0 22 17 876 0 T 0 958 
7 5 43 14 6 25 1 1 883 1 49 1028 
8 14 48 11 31 26 40 17 13 756 18 974 
9 16 10 3 17 80 0 1 75 4 803 1009 
All 1042 1330 887 1042 1086 742 994 1032 898 947 10000 


Table 14.12 Confusion matrix for least squares multi-class classification of 
handwritten digits (test set). 
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Prediction 
Digit 0 1 2 3 4 5 6 7 8 9 Total 
0 5888 1 2 1 3 2 10 0 14 2 5923 
1 1 6679 27 6 11 0 0 10 6 2 6742 
2 11 7 5866 6 12 0 3 22 26 5 5958 
3 1 4 31 5988 0 27 0 24 34 22 6131 
4 1 15 3 0 5748 1 13 4 5 52 5842 
5 6 2 4 26 7 5335 23 2 9 7 5421 
6 8 5 0 0 3 15 5875 0 11 1 5918 
7 3 25 23 4 8 0 1 6159 5 37 6265 
8 5 16 11 12 9 17 11 7 5749 14 5851 
9 10 5 1 29 41 16 2 35 25 5785 5949 


All 5934 6759 5968 6072 5842 5413 5938 6263 5884 5927 60000 


Table 14.13 Confusion matrix for least squares multi-class classification of 
handwritten digits, after addition of 5000 features (training set). 


Prediction 
Digit 0 1 3 4 5 6 7 8 9 Total 
0 972 0 0 2 0 1 1 1 3 0 980 
1 0 1126 3 1 1 0 3 0 1 0 1135 
2 6 0 998 3 2 0 4 7 11 1 1032 
3 0 0 3 977 0 13 0 5 8 4 1010 
4 2 1 3 0 953 0 6 3 1 13 982 
5 2 0 1 5 0 875 5 0 3 1 892 
6 8 3 0 0 4 6 933 0 4 0 958 
7 0 8 12 0 2 0 1 992 3 10 1028 
8 3 1 3 6 4 3 2 2 946 4 974 
9 4 3 1 12 11 7 1 3 3 964 1009 
All 997 1142 1024 1006 977 905 956 1013 983 997 10000 


Table 14.14 Confusion matrix for least squares multi-class classification of 
handwritten digits, after addition of 5000 features (test set). 
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Exercises 


Chebyshev bound. Let f(z) denote the continuous prediction of the Boolean outcome y, 


and f(x) = sign(f(a)) the actual classifier. Let o denote the RMS error in the continuous 
prediction over some set of data, i.e., 


2 (FE) ae tt (FE) -yy 
= ; 


Use the Chebyshev bound to argue that the error rate over this data set, i.e., the fraction 
of data points for which f(a) # y™, is no more than o°, assuming o < 1. 


Remark. This bound on the error rate is usually quite bad; that is, the actual error rate 
in often much lower than this bound. But it does show that if the continuous prediction 
is good, then the classifier must perform well. 


-a the parameters in a regression classifier. Consider a classifier of the form 
ĝ = sign(x* 8 +v), where ĝ is the prediction, the n-vector x is the feature vector, and the 
n-vector 8 and scalar v are the classifier parameters. We will assume here that v 4 0 and 
B #0. Evidently § = sign(v) is the prediction when the feature vector x is zero. Show 
that when ||z|| < |v|/||G||, we have g = sign(v). This shows that when all the features 
are small (i.e., near zero), the classifier makes the prediction 7 = sign(v). Hint. If two 
numbers a and b satisfy |a| < |b], then sign(a + b) = sign(b). 

This means that we can interpret sign(v) as the classifier prediction when the features are 
small. The ratio |v|/||S|| tells us how big the feature vector must be before the classifier 
‘changes its mind’ and predicts 7 = — sign(v). 


Likert classifier. A response to a question has the options Strongly Disagree, Disagree, 
Neutral, Agree, or Strongly Agree, encoded as —2,—1,0,1,2, respectively. You wish to 
build a multi-class classifier that takes a feature vector x and predicts the response. 
A multi-class least squares classifier builds a separate (continuous) predictor for each 
response versus the others. Suggest a simpler classifier, based on one continuous regression 
model f(a) that is fit to the numbers that code the responses, using least squares. 


Multi-class classifier via matrix least squares. Consider the least squares multi-class clas- 
sifier described in §14.3, with a regression model fr (x) = £" Br for the one-versus-others 
classifiers. (We assume that the offset term is included using a constant feature.) Show 
that the coefficient vectors (61,...,8« can be found by solving the matrix least squares 
problem of minimizing || XT 8—Y ||”, where 6 is the nx K matrix with columns £1,..., 8x, 
and Y is an N x K matrix. 


(a) Give Y, i.e., describe its entries. What is the ith row of Y? 


(b) Assuming the rows of X (i.e., the data feature vectors) are linearly independent, 
show that the least squares estimate is given by 6 = (X7)'Y. 


List classifier. Consider a multi-class classification problem with K classes. A standard 
multi-class classifier is a function f that returns a class (one of the labels 1,..., A), when 


given a feature n-vector x. We interpret f (a) as the classifier’s guess of the class that 
corresponds to x. A list classifier returns a list of guesses, typically in order from ‘most 
likely’ to ‘least likely’. For example, for a specific feature vector x, a list classifier might 
return 3,6, 2, meaning (roughly) that its top guess is class 3, its next guess is class 6, and 
its third guess is class 2. (The lists can have different lengths for different values of the 
feature vector.) How would you modify the least squares multi-class classifier described 
in §14.3.1 to create a list classifier? Remark. List classifiers are widely used in electronic 
communication systems, where the feature vector x is the received signal, and the class 
corresponds to which of K messages was sent. In this context they are called list decoders. 
List decoders produce a list of probable or likely messages, and allow a later processing 
stage to make the final decision or guess. 
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14.7 


14.8 


14.9 


Polynomial classifier with one feature. Generate 200 points tO), 22.22), uniformly 
spaced in the interval [—1, 1], and take 


a f +1 -05<2 <010r05< 2” 
= —1 otherwise 


for i = 1,...,200. Fit polynomial least squares classifiers of degrees 0,...,8 to this 
training data set. 


(a) Evaluate the error rate on the training data set. Does the error rate decrease when 
you increase the degree? 


(b) For each degree, plot the polynomial f(x) and the classifier f(x) = sign(f(2)). 


(c) It is possible to classify this data set perfectly using a classifier f(a) = sign(f(a)) 
and a cubic polynomial 


f(x) = c(x + 0.5) (a — 0.1)(x — 0.5), 


for any positive c. Compare this classifier with the least squares classifier of degree 
3 that you found and explain why there is a difference. 


Polynomial classifier with two features. Generate 200 random 2-vectors «,... , £20) in 
a plane, from a standard normal distribution. Define 
(@) _ +1 te >0 
d —1 otherwise 
for i = 1,...,200. In other words, y® is +1 when x is in the first or third quadrant, 


and —1 otherwise. Fit a polynomial least squares classifier of degree 2 to the data set, 
i.e., use a polynomial 


f(a) = 01 + 0201 + 0302 + 041? + O50102 + O6x3. 


Give the error rate of the classifier. Show the regions in the plane where f (x) = 1 and 


f(x) = —1. Also compare the computed coefficients with the polynomial f(x) = «122, 
which classifies the data points with zero error. 


Author attribution. Suppose that the N feature n-vectors «,...,2) are word count 
histograms, and the labels y, sonny y™) give the document authors (as one of 1,..., K). 
A classifier guesses which of the K authors wrote an unseen document, which is called 
author attribution. A least squares classifier using regression is fit to the data, resulting 
in the classifier j 

f(x) = argmax(a7 By + vk). 

k=1,...,K 

For each author (i.e., k = 1,..., K) we find the ten largest (most positive) entries in the 
n-vector 3, and the ten smallest (most negative) entries. These correspond to two sets 
of ten words in the dictionary, for each author. Interpret these words, briefly, in English. 


Nearest neighbor interpretation of multi-class classifier. We consider the least squares 
K-class classifier of §14.3.1. We associate with each data point the n-vector x, and the 
label or class, which is one of 1,..., AK. If the class of the data point is k, we associate 
it with a K-vector y, whose entries are yẹ = +1 and y; = —1 for j Æ k. (We can 
write this vector as y = 2e — 1.) Define ğ = (fi(x),..., fx(x)), which is our (real- 
valued or continuous) prediction of the label y. Our multi-class prediction is given by 
f(w) = argmax,_; x fe(£). Show that f(z) is also the index of the nearest neighbor of 
y among the vectors 2e, — 1, for k = 1,..., K. In other words, our guess ¥ for the class is 
the nearest neighbor of our continuous prediction y, among the vectors that encode the 
class labels. 
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One-versus-one multi-class classifier. In §14.3.1 we construct a K-class classifier from K 
Boolean classifiers that attempt to distinguish each class from the others. In this exercise 
we describe another method for constructing a K-class classifier. We first develop a 
Boolean classifier for every pair of classes i and j, i < j. There are K(K — 1)/2 such 
pairs of classifiers, called one-versus-one classifiers. Given a feature vector x, we let Qij 
be the prediction of the i-versus-7 classifier, with ĝi; = 1 meaning that the one-versus-one 
classifier is guessing that y = i. We consider ĝi; = 1 as one ‘vote’ for class i, and ĝi; = —1 
as one ‘vote’ for class j. We obtain the final estimated class by majority voting: We take 
ĝ as the class that has the most votes. (We can break ties in some simple way, like taking 
the smallest index that achieves the largest number of votes.) 


(a) Construct the least squares classifier, and the one-versus-one classifier, for a multi- 
class (training) data set. Find the confusion matrices, and the error rates, of the 
two classifiers on both the training data set and a separate test data set. 


(b) Compare the complexity of computing the one-versus-one multi-class classifier with 
the complexity of the least squares multi-class classifier (see page 300). Assume the 
training set contains N/K examples of each class and that N/K is much greater than 
the number of features p. Distinguish two methods for the one-versus-one multi-class 
classifier. The first, naive, method solves K(K — 1)/2 least squares problem with 
N/K rows and p columns. The second, more efficient, method precomputes the 
Gram matrices G; = AA] for i = 1,...,K, where the rows of the (N/K) x p 
matrix A; are the training example for class i, and uses the pre-computed Gram 
matrices to speed up the solution of the K(K — 1)/2 least squares problems. 


Equalizer design from training message. We consider an electronic communication system, 
with message to be sent given by an N-vector s, whose entries are —1 or 1, and received 
signal y, where y = cxs, where c is an n-vector, the channel impulse response. The receiver 
applies equalization to the received signal, which means that it computes y = hey = hxcxs, 
where h is an n-vector, the equalizer impulse response. The receiver then estimates the 
original message using § = sign({1.). This works well if h xc © e1. (See exercise 7.15.) 
If the channel impulse response c is known or can be measured, we can design or choose 
h using least squares, as in exercise 12.6. 

In this exercise we explore a method for choosing h directly, without estimating or mea- 
suring c. The sender first sends a message that is known to the receiver, called the training 
message, s‘*"". (From the point of view of communications, this is wasted transmission, 
and is called overhead.) The receiver receives the signal y'"" = cx s"*" from the train- 
ing message, and then chooses h to minimize ||(h x y"*")1.~ — s'™™||?. (In practice, this 
equalizer is used until the bit error rate increases, which means the channel has changed, 
at which point another training message is sent.) Explain how this method is the same as 
least squares classification. What are the training data « and y? What is the least 
squares problem that must be solved to determine the equalizer impulse response h? 
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Chapter 15 


Multi-objective least squares 


In this chapter we consider the problem of choosing a vector that achieves a com- 
promise in making two or more norm squared objectives small. The idea is widely 
used in data fitting, image reconstruction, control, and other applications. 


Multi-objective least squares 


In the basic least squares problem (12.1), we seek the vector ĉ that minimizes 
the single objective function || Ax — b||?. In some applications we have multiple 
objectives, all of which we would like to be small: 


Jy =||Aiz—by||?, «.., Jn = || Ake — brll’. 


Here A; is an m; X n matrix, and b; is an m;-vector. We can use least squares to 
find the z that makes any one of these objectives as small as possible (provided the 
associated matrix has linearly independent columns). This will give us (in general) 
k different least squares approximate solutions. But we seek a single ĉ that gives 
a compromise, and makes them all small, to the extent possible. We call this the 
multi-objective (or multi-criterion) least squares problem, and refer to Jj,..., Jp 
as the k objectives. 


Multi-objective least squares via weighted sum. A standard method for finding 
a value of x that gives a compromise in making all the objectives small is to choose 
x to minimize a weighted sum objective: 


J =I, +++: + Ande = Ai || Are — b; ||? +++ + Akl Ake — byl”, (15.1) 


where A;,...,Ax% are positive weights, that express our relative desire for the terms 
to be small. If we choose all A; to be one, the weighted sum objective is the sum 
of the objective terms; we give each of them equal weight. If Ag is twice as large 
as A,, it means that we attach twice as much weight to the objective Jz as to J4. 
Roughly speaking, we care twice as strongly that Jz should be small, compared 
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to our desire that Jı should be small. We will discuss later how to choose these 
weights. 

Scaling all the weights in the weighted sum objective (15.1) by any positive 
number is the same as scaling the weighted sum objective J by the number, which 
does not change its minimizers. Since we can scale the weights by any positive 
number, it is common to choose A, = 1. This makes the first objective term Jı 
our primary objective; we can interpret the other weights as being relative to the 
primary objective. 


Weighted sum least squares via stacking. We can minimize the weighted sum 
objective function (15.1) by expressing it as a standard least squares problem. We 
start by expressing J as the norm squared of a single vector: 


2 


vanes bı) 
Jia Ane — by) 
where we use the property that ||(a1,...,@x)||? = ||a1|/?+---+||ax||? for any vectors 
Q1,...,@,~. So we have 
2 
JA Al VA1b1 
a o Ve) = ||Ax — bl, 


VÀkAk VÀkbk 
where A and 6 are the matrix and vector 


VMAI VAibi 


œ~ 
II 


A= (15.2) 


JV Ap Ak V Ande 


The matrix A is m x n, and the vector b has length m, where m = mı +--+ mMk. 

We have now reduced the problem of minimizing the weighted sum least squares 
objective to a standard least squares problem. Provided the columns of A are 
linearly independent, the minimizer is unique, and given by 


& = (A™A)'A7 
(Ap AP Ay +++ H AKAT Ak) HALAT bi +++ + ARAL OR). (15.3) 


This reduces to our standard formula for the solution of a least squares problem 
when k = 1 and \; = 1. (In fact, when k = 1, A; does not matter.) We can 
compute ĉ via the QR factorization of A. 


Independent columns of stacked matrix. Our assumption (12.2) that the columns 
of A in (15.2) are linearly independent is not the same as assuming that each of 
Aj,..., Ap has linearly independent columns. We can state the condition that A 
has linearly independent columns as: There is no nonzero vector x that satisfies 
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Aix = 0 fori =1,...,k. This implies that if just one of the matrices A1,..., Ak 
has linearly independent columns, then A does. 

The stacked matrix A can have linearly independent columns even when none 
of the matrices A;,...,A, do. This can happen when m; < n for all i, i.e., all 
A; are wide. However, we must have m +--+ mp > n, since A must be tall or 
square for the linearly independent columns assumption to hold. 


Optimal trade-off curve. We start with the special case of two objectives (also 
called the bi-criterion problem), and write the weighted sum objective as 


J = J +A} = || Az — by ||? + Al] Aaa — bell?, 


where > 0 is the relative weight put on the second objective, compared to the 
first. For small A, we care much more about J; being small than Jp being small; 
for large A, we care much less about Jı being small than Jz being small. 

Let %(A) denote the weighted sum least squares solution ĉ as a function of A, 
assuming the stacked matrices have linearly independent columns. These points 
are called Pareto optimal (after the economist Vilfredo Pareto) which means there 
is no point z that satisfies 


|| Arz — by ||? < || 41@(A) — b11]?, || Azz — ball? < || A2#(A) — ba], 


with one of the inequalities holding strictly. In other words, there is no point z 
that is as good as @(A) in one of the objectives, and beats it on the other one. To 
see why this is the case, we note that any such z would have a value of J that is 
less than that achieved by %(\), which minimizes J, a contradiction. 

We can plot the two objectives || A,#(A) — b1 ||? and || A2#(A) — b2||? against each 
other, as À varies over (0,00), to understand the trade-off of the two objectives. 
This curve is called the optimal trade-off curve of the two objectives. There is no 
point z that achieves values of Jı and Jə that lies below and to the left of the 
optimal trade-off curve. 


Simple example. We consider a simple example with two objectives, with A, and 
Ay both 10 x 5 matrices. The entries of the weighted least squares solution #(A) 
are plotted against A in figure 15.1. On the left, where A is small, ĉ(A) is very close 
to the least squares approximate solution for A,,b,. On the right, where A is large, 
&(A) is very close to the least squares approximate solution for Ao, b2. In between 
the behavior of #(A) is very interesting; for instance, we can see that £(A)3 first 
increases with increasing A before eventually decreasing. 

Figure 15.2 shows the values of the two objectives Jı and Jz versus À. As 
expected, Jı increases as À increases, and J2 decreases as À increases. (It can 
be shown that this always holds.) Roughly speaking, as \ increases we put more 
emphasis on making J2 small, which comes at the expense of making J; bigger. 
The optimal trade-off curve for this bi-criterion problem is plotted in figure 15.3. 
The left end-point corresponds to minimizing || Ax — b1||?, and the right end-point 
corresponds to minimizing || A2x — be||?. We can conclude, for example, that there 
is no vector z that achieves || Aiz— b1||? < 8 and ||A2z — be||? < 5. 
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Figure 15.1 Weighted-sum least squares solution ĉ(A) as a function of A for 
a bi-criterion least squares problem with five variables. 


| | | 

1074 107? 10° 10? 104 
A 

Figure 15.2 Objective functions Jı = ||A1@(A) — bı ||? (blue line) and Jz = 


|| Ao#(A) — be||? (red line) as functions of A for the bi-criterion problem in 
figure 15.1. 


15.1 Multi-objective least squares 


313 


14 


12 F 


Figure 15.3 Optimal trade-off curve for the bi-criterion least squares problem 
of figures 15.1 and 15.2. 


The steep slope of the optimal trade-off curve near the left end-point means 
that we can achieve a substantial reduction in Jj with only a small increase in J4. 
The small slope of the optimal trade-off curve near the right end-point means that 
we can achieve a substantial reduction in J; with only a small increase in Jz. This 
is quite typical, and indeed, is why multi-criterion least squares is useful. 


Optimal trade-off surface. Above we described the case with k = 2 objectives. 
When we have more than 2 objectives, the interpretation is similar, although it is 
harder to plot the objectives, or the values of ĉ, versus the weights. For example 
with k = 3 objectives, we have two weights, A2 and A3, which give the relative 
weight of Jz and J3 compared to Jı. Any solution @(A) of the weighted least squares 
problem is Pareto optimal, which means that there is no point that achieves values 
of Jı, Jo, J3 less than or equal to those obtained by #(A), with strict inequality 
holding for at least one of them. As the parameters 2 and \3 vary over (0,00), 
the values of Jı, J2, J3 sweep out the optimal trade-off surface. 


Using multi-objective least squares. In the rest of this chapter we will see several 
specific applications of multi-objective least squares. Here we give some general 
remarks on how it is used in applications. 

First we identify a primary objective J, that we would like to be small. The 
objective Jı is typically the one that would be used in an ordinary single-objective 
least squares approach, such as the mean square error of a model on some training 
data, or the mean square deviation from some target or goal. 
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We also identify one or more secondary objectives J2, J3,..., Jk, that we would 
also like to be small. These secondary objectives are typically generic ones, like 
the desire that some parameters be ‘small’ or ‘smooth’, or close to some previous 
or prior value. In estimation applications these secondary objectives typically cor- 
respond to some kind of prior knowledge or assumption about the vector x that 
we seek. We wish to minimize our primary objective, but are willing to accept an 
increase in it, if this gives a sufficient decrease in the secondary objectives. 

The weights are treated like ‘knobs’ in our method, that we change (‘turn’ or 
‘tune’ or ‘tweak’) to achieve a value of ĉ that we like (or can live with). For given 
candidate values of A we evaluate the objectives; if we decide that Jz is larger 
than we would like, but we can tolerate a somewhat larger J3, then we increase 
A2 and decrease A3, and find ĉ and the associated values of Jı, J2, J3 using the 
new weights. This is repeated until a reasonable trade-off among them has been 
obtained. In some cases we can be principled in how we adjust the weights; for 
example, in data fitting, we can use validation to help guide us in the choice of 
the weights. In many other applications, it comes down to (application-specific) 
judgment or even taste. 

The additional terms A2J2,...,A~nJ% that we add to the primary objective Jj, 
are sometimes called regularization (terms). The secondary objectives are some- 
times described by name, as in ‘least squares fitting with smoothness regulariza- 
tion’. 

In exploring the trade-offs among the objectives, the weights are typically varied 
over a wide range of values, by choosing a finite number of values (perhaps ten or 
a few tens) that are logarithmically spaced, as in figures 15.1 and 15.2. This means 
that for N values of À between \™™” and \™**, we use the values 

ymu. pami”, g2 amin, ane oN- ymin = max 


9 


with 0 = ne pee, 


Control 


In control applications, the goal is to decide on a set of actions or inputs, specified 
by an n-vector x, that achieve some goals. The actions result in some outputs or 
effects, given by an m-vector y. We consider here the case when the inputs and 
outputs are related by an affine model 


y = Ax +b. 


The m x n matrix A and m-vector b characterize the input-output mapping of 
the system. The model parameters A and b are found from analytical models, 
experiments, computer simulations, or fit to past (observed) data. Typically the 
input or action x = 0 has some special meaning. The m-vector b gives the output 
when the input is zero. In many cases the vectors x and y represent deviations of 
the inputs and outputs from some standard values. 
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We typically have a desired or target output, denoted by the m-vector y®°s. 
The primary objective is 
Jy = || Ax +b — y*|/?, 


the norm squared deviation of the output from the desired output. The main 
objective is to choose an action x so that the output is as close as possible to the 
desired value. 

There are many possible secondary objectives. The simplest one is the norm 
squared value of the input, J2 = ||x||?, so the problem is to optimally trade off 
missing the target output (measured by ||y — y***||?), and keeping the input small 
(measured by ||2’||?). 

Another common secondary objective has the form Jz = ||x — x where 
x°™ is a nominal or standard value for the input. In this case the secondary ob- 
jective it to keep the input close to the nominal value. This objective is sometimes 
used when x represents a new choice for the input, and 2°°™ is the current value. 
In this case the goal is to get the output near its target, while not changing the 
input much from its current value. 


a 
9 


Control of heating and cooling. As an example, x could give the vector of n 
heating (or cooling) power levels in a commercial building with n air handling 
units (with z; > 0 meaning heating and x; < 0 meaning cooling) and y could 
represent the resulting temperature at m locations in the building. The matrix 
A captures the effect of each of n heating/cooling units on the temperatures in 
the building at each of m locations; the vector b gives the temperatures at the m 
locations when no heating or cooling is applied. The desired or target output might 
be y%s = T?*51, assuming the target temperature is the same at all locations. The 
primary objective ||y—y**||? is the sum of squares of the deviations of the location 
temperatures from the target temperature. The secondary objective Jz = ||zx||?, 
the norm squared of the vector of heating/cooling powers, would be reasonable, 
since it is at least roughly related to the energy cost of the heating and cooling. 

We find tentative choices of the input by minimizing Jı +A2J2 for various values 
of Ag. If for the current value of Az the heating/cooling powers are larger than we’d 
like, we increase Ay and re-compute ĉ. 


Product demand shaping. In demand shaping, we adjust or change the prices of 
a set of n products in order to move the demand for the products towards some 
given target demand vector, perhaps to better match the available supply of the 
products. The standard price elasticity of demand model is 5¢¢™ = E%5P"°°, where 
dm is the vector of fractional demand changes, 5P™°° is the vector of fractional 
price changes, and Ed is the price elasticity of demand matrix. (These are all 
described on page 150.) In this example the price change vector 6P"°° represents 
the action that we take; the result is the change in demand, 5¢°™. The primary 
objective could be 


J, = (oe _ terj]? = || Bt grise = S]e, 


where tè! is the target change in demand. 
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At the same time, we want the price changes to be small. This suggests the 
secondary objective Jo = ||6?"°¢|/?. We then minimize Jı + \J2 for various values 
of À, which trades off how close we come to the target change in demand with how 
much we change the prices. 


Dynamics. The system can also be dynamic, meaning that we take into account 
time variation of the input and output. In the simplest case x is the time series of 
a scalar input, so 2; is the action taken in period i, and y; is the (scalar) output in 
period i. In this setting, ył% is a desired trajectory for the output. A very common 
model for modeling dynamic systems, with x and y representing scalar input and 
output time series, is a convolution: y = hx x. In this case, A is Toeplitz, and b 
represents a time series, which is what the output would be with x = 0. 

As a typical example in this category, the input x; can represent the torque 
applied to the drive wheels of a locomotive (say, over one second intervals), and y; 
is the locomotive speed. 

In addition to the usual secondary objective Jz = ||æ||?, it is common to have 
an objective that the input should be smooth, i.e., not vary too rapidly over time. 
This is achieved with the objective || Dz||?, where D is the (n—1) x n first difference 
matrix 


-1 10: 0 0 0 
0 -1 1+- 0 0 0 
D= oe : caer eal (15.4) 
0 00> -1 10 
0 0 0 0 -1 1 


Estimation and inversion 


In the broad application area of estimation (also called inversion), the goal is to 
estimate a set of n values (also called parameters), the entries of the n-vector x. We 
are given a set of m measurements, the entries of an m-vector y. The parameters 
and measurements are related by 


y = Az +v, 


where A is a known m x n matrix, and v is an unknown m-vector. The matrix A 
describes how the measured values (i.e., yi) depend on the unknown parameters 
(i.e., xj). The m-vector v is the measurement error or measurement noise, and is 
unknown but presumed to be small. The estimation problem is to make a sensible 
guess as to what x is, given y (and A), and prior knowledge about x. 

If the measurement noise were zero, and A has linearly independent columns, 
we could recover x exactly, using x = Aty. (This is called exact inversion.) Our 
job here is to guess x, even when these strong assumptions do not hold. Of course 
we cannot expect to find x exactly, when the measurement noise is nonzero, or 
when A does not have linearly independent columns. This is called approximate 
inversion, or in some contexts, just inversion. 


15.3.1 
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The matrix A can be wide, square, or tall; the same methods are used to esti- 
mate x in all three cases. When A is wide, we would not have enough measurements 
to determine x from y, even without the noise (i.e., with v = 0). In this case we 
have to also rely on our prior information about x to make a reasonable guess. 
When A is square or tall, we would have enough measurements to determine zx, if 
there were no noise present. Even in this case, judicious use of multiple-objective 
least squares can incorporate our prior knowledge in the estimation, and yield far 
better results. 


Regularized inversion 


If we guess that x has the value ĉ, then we are implicitly making the guess that 
v has the value y — Aĉ. If we assume that smaller values of v (measured by ||v||) 
are more plausible than larger values, then a sensible choice for ĉ is the least 
squares approximate solution, which minimizes || Aĉ — y||?. We will take this as our 
primary objective. Our prior information about x enters in one or more secondary 
objectives. Simple examples are listed below. 


e ||z||?: 2 should be small. This corresponds to the (prior) assumption that x 
is more likely to be small than large. 


e |jz — xP™er|/2: x should be near xP". This corresponds to the assumption 
that x is near some known vector xP", 


e ||Dz||?, where D is the first difference matrix (15.4). This corresponds to 
the assumption that x should be smooth, i.e., £i+ı should be near x;. This 
regularization is often used when x represents a time series. 


e The Dirichlet energy D(x) = ||ATx||?, where A is the incidence matrix of 
a graph (see page 135). This corresponds to the assumption that x varies 
smoothly across the graph, i.e., xi is near x; when ¿i and j are connected by 
an edge of the graph. When the Dirichlet energy is used as a regularizer, it 
is sometimes called Laplacian regularization. (The previous example, ||D2||?, 
is special case of Dirichlet energy, for the chain graph.) 


Finally, we will choose our estimate ĉ by minimizing 
|| Ax — yl? + à2J2(£) +++ + Apdp(a), 


where à; > 0 are weights, and J2,..., J, are the regularization terms. This is called 
regularized inversion or regularized estimation. We may repeat this for several 
choices of the weights, and choose the best estimate for the particular application. 


Tikhonov regularized inversion. Choosing ĉ to minimize 
|| Ax — yl? + Allel? 


for some choice of À > 0 is called Tikhonov regularized inversion, after the math- 
ematician Andrey Tikhonov. Here we seek a guess ĉ that is consistent with the 
measurements (i.e., | Aĉ — y||? is small), but not too big. 
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15.3.2 


The stacked matrix in this case, 


a=| Srl 


always has linearly independent columns, without any assumption about A, which 
can have any dimensions, and need not have linearly independent columns. To see 
this we note that Ar = (Ax, VAr) = 0 implies that VAr = 0, which implies x = 0. 
The Gram matrix associated with A, 


ATA=ATA+HAI, 


is therefore always invertible (provided à > 0). The Tikhonov regularized approx- 
imate solution is then 
& = (ATA + AITA". 


Equalization. The vector x represents a transmitted signal or message, consisting 
of n real values. The matrix A represents the mapping from the transmitted signal 
to what is received (called the channel); y = Ax + v includes noise as well as the 
action of the channel. Guessing what x is, given y, can be thought of as un-doing 
the effects of the channel. In this context, estimation is called equalization. 


Estimating a periodic time series 


Suppose that the T-vector y is a (measured) time series, that we believe is a noisy 
version of a periodic time series, i.e., one that repeats itself every P periods. We 
might also know or assume that the periodic time series is smooth, i.e., its adjacent 
values are not too far apart. 

Periodicity arises in many time series. For example, we would expect a time 
series of hourly temperature at some location to approximately repeat itself every 
24 hours, or the monthly snowfall in some region to approximately repeat itself 
every 12 months. (Periodicity with a 24 hour period is called diurnal; periodicity 
with a yearly period is called seasonal or annual.) As another example, we might 
expect daily total sales at a restaurant to approximately repeat itself weekly. The 
goal is to get an estimate of Tuesday’s total sales, given some historical daily sales 
data. 

The periodic time series will be represented by a P-vector x, which gives its 
values over one period. It corresponds to the full time series 


YS (Da e158) 


which just repeats x, where we assume here for simplicity that T is a multiple of P. 
(If this is not the case, the last x is replaced with a slice of the form x1...) We can 
express 7 as ĝ = Ax, where A is the T x P selector matrix 


I 
A= 
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Our total square estimation error is || Ax — y||?. 

We can minimize this objective analytically. The solution ĉ is found by av- 
eraging the values of y associated with the different entries in x. For example, 
we estimate Tuesday sales by averaging all the entries in y that correspond to 
Tuesdays. (See exercise 15.10.) This simple averaging works well if we have many 
periods worth of data, i.e., if T/P is large. 

A more sophisticated estimate can be found by adding regularization for x to 
be smooth, based on the assumption that 

Ly X Lo, ssy Lp_1 ~ TP, tP 7 Tı. 
(Note that we include the ‘wrap-around’ pair xp and xı here.) We measure non- 
smoothness as ||D°"¢z||?, where D®° is the P x P circular difference matrix 


| —1 1 0 0 0 0 ] 
0 -1 1 0 0 0 
päre - : : ; 
0 0 0 —1 1 0 
0 0 0 0 -1 1 
1 0 0 0 0 -1 


We estimate the periodic time series by minimizing 
||Ax = yll? +All Dz]. 


For à = 0 we recover the simple averaging mentioned above; as À gets bigger, the 
estimated signal becomes smoother, ultimately converging to a constant (which is 
the mean of the original time series data). 

The time series Aĉ is called the extracted seasonal component of the given time 
series data y (assuming we are considering yearly variation). Subtracting this from 
the original data yields the time series y— Aĉ, which is called the seasonally adjusted 
time series. 

The parameter À can be chosen using validation. This can be done by selecting 
a time interval over which to build the estimate, and another one to validate it. 
For example, with 4 years of data, we might train our model on the first 3 years of 
data, and test it on the last year of data. 


Example. In figure 15.4 we apply this method to a series of hourly ozone mea- 
surements. The top figure shows hourly measurements over a period of 14 days 
(July 1-14, 2014). We represent these values by a 336-vector c, with ¢24(;~1) +i; 
i =1,...,24, defined as the hourly values on day j, for j = 1,...,14. As indicated 
by the gaps in the graph, a number of measurements are missing from the record 
(only 275 of the 336 = 24 x 14 measurements are available). We use the notation 
M; C {1,2,...,24} to denote the set containing the indices of the available mea- 
surements on day j. For example, Mg = {1, 2,3, 4,6, 7,8, 23, 24}, because on July 
8, the measurements at 4AM, and from 8AM to 9PM are missing. The middle and 
bottom figures show two periodic time series. The time series are parametrized 
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Figure 15.4 Top. Hourly ozone level at Azusa, California, during the first 
14 days of July 2014 (California Environmental Protection Agency, Air Re- 
sources Board, www.arb.ca.gov). Measurements start at 12AM on July 1st, 
and end at 11PM on July 14. Note the large number of missing measure- 
ments. In particular, all 4AM measurements are missing. Middle. Smooth 
periodic least squares fit to logarithmically transformed measurements, using 
A= 1. Bottom. Smooth periodic least squares fit using A = 100. 
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by a 24-vector x, repeated 14 times to get the full series (a, x,..., x). The two 
estimates of x in the figure were computed by minimizing 


14 23 
Yo DY (ei = log (coay—1 44)? +A (Eem = zi)? + (z1 — za) 


j=l i€Mj i=1 


for A = 1 and A = 100. 


15.3.3 Image de-blurring 


The vector x is an image, and the matrix A gives blurring, so y = Az+v is a blurred, 
noisy image. Our prior information about x is that it is smooth; neighboring pixels 
values are not very different from each other. Estimation is the problem of guessing 
what x is, and is called de-blurring. 

In least squares image deblurring we form an estimate ĉ by minimizing a cost 
function of the form 


|| Ax — yll? + A(|| Dual? + ||Dya||*). (15.5) 


Here Dy and Dp represent vertical and horizontal differencing operations, and 
the role of the second term in the weighted sum is to penalize non-smoothness in 
the reconstructed image. Specifically, suppose the vector x has length MN and 
contains the pixel intensities of an M x N image X stored column-wise. Let Dp 
be the M(N —1) x MN matrix 


-I I 0. 0 0 0 

0 -I I à 0 0 0 
DiS Bs. ee ae a) | 

0 0 0- -f I 0 

0 0 O -- 0 -II 


where all blocks have size M x M, and let Dy be the (M — 1)N x MN matrix 


0D- 0 
D, = l 


where each of the N diagonal blocks D is an (M — 1) x M difference matrix 


—1 1 0 0 0 0 
0 -1 1 0 0 0 
D=| : : : 
0 0 0 = 1 0 
0 0 0 0 -1 1 
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Figure 15.5 Left: Blurred, noisy image. Right: Result of regularized least 
squares deblurring with A = 0.007. Image credit: NASA. 


With these definitions the penalty term in (15.5) is the sum of squared differences 
of intensities at adjacent pixels in a row or column: 


M N-1 M-1 N 
| Dual? + Dyal? =X >> (Xij = Xgl? + OS D5 Kiar yg -Xu 
=r g=1 w=1 g=1 


This quantity is the Dirichlet energy (see page 135), for the graph that connects 
each pixel to its left and right, and up and down, neighbors. 


Example. In figures 15.5 and 15.6 we illustrate this method for an image of size 
512 x 512. The blurred, noisy image is shown in the left part of figure 15.5. 
Figure 15.6 shows the estimates ĉ, obtained by minimizing (15.5), for four different 
values of the parameter A. The best result (in this case, judged by eye) is obtained 
for À around 0.007 and is shown on the right in figure 15.5. 


Tomography 


In tomography, the vector x represents the values of some quantity (such as density) 
in a region of interest in n voxels (or pixels) over a 3-D (or 2-D) region. The entries 
of the vector y are measurements obtained by passing a beam of radiation through 
the region of interest, and measuring the intensity of the beam after it exits the 
region. 

A familiar application is the computer-aided tomography (CAT) scan used in 
medicine. In this application, beams of X-rays are sent through a patient, and 
an array of detectors measure the intensity of the beams after passing through 
the patient. These intensity measurements are related to the integral of the X- 
ray absorption along the beam. Tomography is also used in applications such as 
manufacturing, to assess internal damage or certify quality of a welded joint. 
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Figure 15.6 Deblurred images for A = 107°, 1074, 107?, 1. Image credit: 
NASA. 


324 


15 Multi-objective least squares 


Line integral measurements. For simplicity we will assume that each beam is a 
single line, and that the received value y; is the integral of the quantity over the 
region, plus some measurement noise. (The same method can be used when more 
complex beam shapes are used.) We consider the 2-D case. 

Let d(x,y) denote the density (say) at the position (x,y) in the region. (Here 
x and y are the scalar 2-D coordinates, not the vectors x and y in the estimation 
problem.) We assume that d(x,y) = 0 outside the region of interest. A line through 
the region is defined by the set of points 


p(t) = (0, yo) + t(cos 6, sin 0), 


where (Zo, yo) denotes a (base) point on the line, and @ is the angle of the line with 
respect to the x-axis. The parameter t gives the distance along the line from the 
point (£o, yo). The line integral of d is given by 


co 
[toto at 
—6o 

We assume that m lines are specified (e.g., by their base points and angles), and 
the measurement y; is the line integral of d, plus some noise, which is presumed 
small. 

We divide the region of interest into n pixels (or voxels in the 3-D case), and 
assume that the density has the constant value x; over pixel i. Figure 15.7 illus- 
trates this for a simple example with n = 25 pixels. (In applications the number 
of pixels or voxels is in the thousands or millions.) The line integral is then given 
by the sum of z; (the density in pixel 2) times the length of the intersection of the 
line with pixel 7. In figure 15.7, with the pixels numbered row-wise starting at the 
top left corner, with width and height one, the line integral for the line shown is 


1.06216 + 0.80217 + 0.272129 + 1.06213 + 1.06214 + 0.538215 + 0.54210. 


The coefficient of x; is the length of the intersection of the line with pixel i. 


Measurement model. We can express the vector of m line integral measurements, 
without the noise, as Ax, where the m x n matrix A has entries 


Aij = length of line i in pixel j, t=1,...,.m, j=1,...,n, 


with A;; = 0 if line į does not intersect voxel j. 


Tomographic reconstruction. In tomography, estimation or inversion is often 
called tomographic reconstruction or tomographic inversion. 

The objective term || Az — y||? is the sum of squares of the residual between the 
predicted (noise-free) line integrals Ax and the actual measured line integrals y. 
Regularization terms capture prior information or assumptions about the voxel 
values, for example, that they vary smoothly over the region. A simple regularizer 
commonly used is the Dirichlet energy (see page 135) associated with the graph 
that connects each voxel to its 6 neighbors (in the 3-D case) or its 4 neighbors (in 
the 2-D case). Using the Dirichlet energy as a regularizer is also called Laplacian 
regularization. 
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Figure 15.7 A square region of interest divided into 25 pixels, and a line 
passing through it. 


Example. A simple 2-D example is shown in figures 15.8-15.10. Figure 15.8 shows 
the geometry of the m = 4000 lines and the square region, shown as the square. 
The square is divided into 100 x 100 pixels, so n = 10000. 

The density of the object we are imaging is shown in figure 15.9. In this object 
the density of each pixel is either 0 or 1 (shown as white or black, respectively). 
We reconstruct or estimate the object density from the 4000 (noisy) line integral 
measurements by solving the regularized least squares problem 


minimize || Ax — y||? + A|| Dal, 


where || Dz||? is the sum of squares of the differences of the pixel values from their 
neighbors. Figure 15.10 shows the results for six different values of A. We can see 
that for small À the reconstruction is relatively sharp, but suffers from noise. For 
large À the noise in the reconstruction is smaller, but it is too smooth. 


Regularized data fitting 


We consider least squares data fitting, as described in chapter 13. In §13.2 we con- 
sidered the issue of over-fitting, where the model performs poorly on new, unseen 
data, which occurs when the model is too complicated for the given data set. The 
remedy is to keep the model simple, e.g., by fitting with a polynomial of not too 
high a degree. 

Regularization is another way to avoid over-fitting, different from simply choos- 
ing a model that is simple (i.e., does not have too many basis functions). Regular- 
ization is also called de-tuning, shrinkage, or ridge regression, for reasons we will 
explain below. 
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Figure 15.8 The square region at the center of the picture is surrounded 
by 100 points shown as circles. 40 lines (beams) emanate from each point. 
(The lines are shown for two points only.) This gives a total of 4000 lines 
that intersect the region. 


Figure 15.9 Density of object used in the tomography example. 
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Figure 15.10 Regularized least squares reconstruction for six values of the 
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Motivation. To motivate regularization, consider the model 


a 


F(x) = file) + + 6,778). (15.6) 


We can interpret 0; as the amount by which our prediction depends on f;(x), so 
if 0; is large, our prediction will be very sensitive to changes or variations in the 
value of f;(x), such as those we might expect in new, unseen data. This suggests 
that we should prefer that 0; be small, so our model is not too sensitive. There 
is one exception here: if f;(x) is constant (for example, the number one), then we 
should not worry about the size of 6;, since f;(x) never varies. But we would like 
all the others to be small, if possible. 

This suggests the bi-criterion least squares problem with primary objective 
|y — A@||?, the sum of squares of the prediction errors, and secondary objective 
||92.p||?, assuming that fı is the constant function one. Thus we should minimize 


ly — All? + All@2:pll*, (15.7) 


where > 0 is called the regularization parameter. 
For the regression model, this weighted objective can be expressed as 


ly — X78 — v1? + AlB. 


Here we penalize 8 being large (because this leads to sensitivity of the model), but 
not the offset v. Choosing 8 to minimize this weighted objective is called ridge 
regression. 


Effect of regularization. The effect of the regularization is to accept a worse value 
of sum square fit (||y—Ad]||”) in return for a smaller value of ||82:p||?, which measures 
the size of the parameters (except 01, which is associated with the constant basis 
function). This explains the name shrinkage: The parameters are smaller than they 
would be without regularization, i.e., they are shrunk. The term de-tuned suggests 
that with regularization, the model is not excessively ‘tuned’ to the training data 
(which would lead to over-fit). 


Regularization path. We get a different model for every choice of A. The way 
the parameters change with A is called the regularization path. When p is small 
enough (say, less than 15 or so) the parameter values can be plotted, with A on the 
horizontal axis. Usually only 30 or 50 values of are considered, typically spaced 
logarithmically over a large range (see page 314). 

An appropriate value of A can be chosen via out-of-sample or cross-validation. 
As A increases, the RMS fit on the training data worsens (increases). But (as with 
model order) the test set RMS prediction error typically decreases as À increases, 
and then, when à gets too big, it increases. A good choice of regularization pa- 
rameter is one which approximately minimizes the test set RMS prediction error. 
When multiple values of A approximately minimize the RMS error, common prac- 
tice is to take the largest value of A. The idea here is to use the model of minimum 
sensitivity, as measured by ||3||?, among those that make good predictions on the 
test set. 
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Figure 15.11 A signal s(t) and 30 noisy samples. Ten of the samples are 
used as training set, the other 20 as test set. 


Example. We illustrate these ideas with a small example with synthetic (simu- 
lated) data. We start with a signal, shown in figure 15.11, consisting of a constant 
plus four sinusoids: 


4 
s(t) =c+ >D ap cos(lwkt + dg), 
k=1 
with coefficients 
c= 1.54, a, = 0.66, ag=—0.90, &z3 = —0.66, a4 = 0.89. (15.8) 


(The other parameters are wı = 13.69, w2 = 3.55, w3 = 23.25, w4 = 6.03, and 
go, = 0.21, d2 = 0.02, d3 = —1.87, d4 = 1.72.) We will fit a model of the 
form (15.6) with p = 5 and 


filz) =1,  fepi(e) = cos(wkt +o), k=1,...,4. 


(Note that the model is exact when the parameters are chosen as 0; = c, 0, = Qk—1, 
k = 2,...,5. This rarely occurs in practice.) We fit our model using regularized 
least squares on 10 noisy samples of the function, shown as the open circles in 
figure 15.11. We will test the model obtained on the 20 noisy data points shown 
as the filled circles in figure 15.11. 

Figure 15.12 shows the regularization path and the RMS training and test errors 
as functions of the regularization parameter A, as it varies over a large range. The 
regularization path shows that as A increases, the parameters 62, ...,05 get smaller 
(i.e., shrink), converging towards zero as À gets very large. We can see that the 
training prediction error increases with increasing À (since we are trading off model 
sensitivity for sum square fitting error). The test error follows a typical pattern: It 
first decreases to a minimum value, then increases again. The minimum test error 


330 


15 Multi-objective least squares 


15.5 


occurs around À = 0.079; any choice between around \ = 0.065 and 0.100 (say) 
would be reasonable. The horizontal dashed lines show the ‘true’ values of the 
coefficients (i.e., the ones we used to synthesize the data) given in (15.8). We can 
see that for À near 0.079, our estimated parameters are close to the ‘true’ values. 


Linear independence of columns. One side benefit of adding regularization to 
the basic least squares data fitting method, as in (15.7), is that the columns of the 
associated stacked matrix are always linearly independent, even if the columns of 
the matrix A are not. To see this, suppose that 


| ap |=? 


where B is the (p — 1) x p selector matrix 


B= Ce eee 
so BO = @2.,. From the last p—1 entries in the equation above, we get VÀti = 0 for 
i = 2,...,p, which implies that x2 =--- = £p = 0. Using these values of x2,..., £p, 
and the fact that the first column of A is 1, the top m equations become 12, = 0, 
and we conclude that x; = 0 as well. So the columns of the stacked matrix are 
always linearly independent. 


Feature engineering and regularized least squares. The simplest method of 
avoiding over-fit is to keep the model simple, which usually means that we should 
not use too many features. A typical and rough rule of thumb is that the number 
of features should be small compared to the number of data points (say, no more 
than 10% or 20%). The presence of over-fit can be detected using out-of-sample 
validation or cross-validation, which is always done when you fit a model to data. 

Regularization is a powerful alternative method to avoid over-fitting a model. 
With regularization, you can fit a model with more features than would be appro- 
priate without regularization. You can even fit a model using more features than 
you have data points, in which case the matrix A is wide. Regularization is often 
the key to success in feature engineering, which can greatly increase the number of 
features. 


Complexity 


In the general case we can minimize the weighted sum objective (15.1) by creating 
the stacked matrix and vector A and 6 in (15.2), and then using the QR factorization 
to solve the resulting least squares problem. The complexity of this method is order 
mn? flops, where m = m +: -+m is the sum of heights of the matrices A1,..., Ap. 

When using multi-objective least squares, it is common to minimize the weighted 
sum objective for some, or even many, different choices of weights. Assuming that 
the weighted sum objective is minimized for L different values of the weights, the 
total complexity is order Lmn? flops. 
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Figure 15.12 Top. RMS training and test errors as a function of the reg- 
ularization parameter A. Bottom. The regularization path. The dashed 
horizontal lines show the values of the coefficients used to generate the data. 
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15.5.1 


15.5.2 


Gram caching 


We start from the formula (15.3) for the minimizer of the weighted sum objective, 
ê = (A AT Ay ++ ++ + ApAP Ag) AAT bi + +++ + Ag Az by). 


The matrix appearing in the inverse is a weighted sum of the Gram matrices G; = 
A? A; associated with the matrices A;. We can compute ĉ by forming these Gram 
matrices G;, along with the vectors h; = A? b;, then forming the weighted sums 


G = Gy +--+ +ARGk, h = Mha +--+ +Axnhe, 


and finally, solving the n x n set of equations Gê = h. Forming G; and h; costs 
m;n? and 2m;n flops, respectively. (We save a factor of two in forming the Gram 
matrix; see page 182.) Ignoring the second term and adding over i = 1,...,k we 
get a total of mn? flops. Forming the weighted sums G and h costs 2kn? flops. 
Solving Gê = h costs order 2n? flops. 

Gram caching is the simple trick of computing G; (and h;) just once, and re- 
using these matrices and vectors for the L different choices of weights. This leads 
to a complexity of 

mn? + L(k + 2n)n? 


flops. When m is much larger than k +n, which is a common occurrence, this cost 
is smaller than Lmn?, the cost for the simple method. 
As a simple example consider Tikhonov regularization. We will compute 


# = (ATA+ A DT)-1ATO 


fori = 1,..., L, where A is m x n. The cost of the simple method is 2Lmn? flops; 
using Gram caching the cost is mn? + 2Ln? = (m + 2Ln)n? flops. (We drop the 
term Lkn?, since k = 2 here.) With m = 100n and L = 100, Gram caching reduces 
the computational cost by more than a factor of 50. This means that the entire 
regularization path (i.e., the solution for 100 values of A) can be computed in not 
much more time than it takes to compute the solution for one value of À. 


The kernel trick 


In this section we focus on another special case, which arises in many applications: 
J = ||Ax — b||? + Alla — 28% |)?, (15.9) 


where the m x n matrix A is wide, i.e., m < n, and àA > 0. (Here we drop the 
subscripts on A, b, and m since we have only one matrix in this problem.) The 
associated (m + n) x n stacked matrix (see (15.2)) 


~ A 
A = 
| VAI | 
always has linearly independent columns. Using the QR factorization to solve the 
stacked least squares problem requires 2(m + n)n? flops, which grows like n3. We 
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will show now how this special problem can be solved far more efficiently when m 
is much smaller than n, using something called the kernel trick. Recall that the 
minimizer of J is given by (see (15.3)) 

ĉ = (ATA+AD 1 (ATO+Ac™) 
(AP ASAD UA b+ (AI + A Ae — Ag) 
(ATA +AI)-1 AT (b — Ax?) + oo, 


The matrix inverse here has size n x n. 
We will use the identity 


(ATA+AD-1AT = AT(AAT + AI)! (15.10) 


which holds for any matrix A and any \ > 0. Note that the left-hand side of the 
identity involves the inverse of an nxn matrix, whereas the right-hand side involves 
the inverse of a (smaller) m x m matrix. (This is a variation on the push-through 
identity from exercise 11.9.) 

To show the identity (15.10), we first observe that the matrices AT A + AI and 
AAT + XI are invertible. We start with the equation 


AT (AA? + XI) = (ATA+AD A", 


and multiply each side by (AT A+AJ)~! on the left and (AA? +AJ)~! on the right, 
which yields the identity above. 
Using (15.10) we can express the minimizer of J as 


& = AT(AAT + AI)! (b — Ax?) + 29, 


We can compute the term (AAT + AI)~1(b— Ax**s) by computing the QR factor- 
ization of the (m+n) x m matrix 


T 
A 
VAI 
which has a cost of 2(m + n)m? flops. The other operations involve matrix-vector 
products and have order (at most) mn flops, so we can use this method to compute 
ĉ in around 2(m + n)m? flops. This complexity grows only linearly in n. 

To summarize, we can minimize the regularized least squares objective J in (15.9) 
two different ways. One requires a QR factorization of the (m + n) x n matrix A, 
which has cost 2(m + n)n? flops. The other (which uses the kernel trick) requires 
a QR factorization of the (m+n) x m matrix A, which has cost 2(m + n)m? flops. 
We should evidently use the kernel trick when m < n. The complexity can then 
be expressed as 


(m+n) min{m?,n7} x min{mn?,nm?} = (max{m, n})(min{m, n})?. 


where © means we ignore non-dominant terms. 
This is an instance of the big-times-small-squared rule or mnemonic, which 
states that many operations involving a matrix A can be carried out with order 


(big) x (small)? flops, 


where ‘big’ and ‘small’ refer to the big and small dimensions of the matrix. Several 
other examples are listed in appendix B. 
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15.1 


15.2 


15.3 


15.4 


15.5 


Exercises 


A scalar multi-objective least squares problem. We consider the special case of the multi- 
objective least squares problem in which the variable x is a scalar, and the k matrices A; 
are all 1 x 1 matrices with value A; = 1, so J; = (x— bi)”. In this case our goal is to choose 


a number z that is simultaneously close to all the numbers b1,..., bp. Let Ai,...,Ax% be 
positive weights, and ĉ the minimizer of the weighted objective (15.1). Show that ĉ is a 
weighted average (or convex combination; see page 17) of the numbers 61,..., bx, i.e., it 


has the form 
x = wibi +--+: + Wkbk, 


where w; are nonnegative and sum to one. Give an explicit formula for the combination 
weights w; in terms of the multi-objective least squares weights Ai. 

Consider the regularized data fitting problem (15.7). Recall that the elements in the first 
column of A are one. Let 0 be the solution of (15.7), i.e., the minimizer of 


|40 — yll? +A (03 +- + 8) 


and let 6 be the minimizer of 


|| 49 — yll? + All? = 48 — yl]? + A(T + 03 +-+- +85), 


in which we also penalize 01. Suppose columns 2 through p of A have mean zero (for 
example, because features 2,...,p have been standardized on the data set; see page 269). 


Show that Êk = 6x for k = 2,...,p. 


Weighted Gram matriz. Consider a multi-objective least squares problems with matrices 
Ai,..., Ax and positive weights A1,...,Ax. The matrix 


G = Al Al +--+ + An Ag Ak 


is called the weighted Gram matriz; it is the Gram matrix of the stacked matrix A (given 
in (15.2)) associated with the multi-objective problem. Show that G is invertible provided 
there is no nonzero vector x that satisfies Aiz = 0, ..., Apxv = 0. 


Robust approximate solution of linear equations. We wish to solve the square set of n 
linear equations Ax = b for the n-vector x. If A is invertible the solution is x = A~'b. 
In this exercise we address an issue that comes up frequently: We don’t know A exactly. 
One simple method is to just choose a typical value of A and use it. Another method, 
which we explore here, takes into account the variation in the matrix A. We find a set of 
K versions of A, and denote them as A ,..., A, (These could be found by measuring 
the matrix A at different times, for example.) Then we choose x so as to minimize 


JAP zx — bl? +--+ + JAC a — bl’, 


the sum of the squares of residuals obtained with the K versions of A. This choice of x, 
which we denote z™°”, is called a robust (approximate) solution. Give a formula for x"°?, in 
terms of A™,..., AO and b. (You can assume that a matrix you construct has linearly 
independent columns.) Verify that for K = 1 your formula reduces to a’°? = (A“)~1b, 
Some properties of bi-objective least squares. Consider the bi-objective least squares prob- 
lem with objectives 


Ji(w) = |Air- bl?, Ja(w) = || Aw — ball’. 


For A > 0, let @(A) denote the minimizer of Jı (x) + AJ2(x). (We assume the columns of 
the stacked matrix are linearly independent.) We define Ji (A) = Ji(@(A)) and J3 (A) = 
J2(&(A)), the values of the two objectives as functions of the weight parameter. The 
optimal trade-off curve is the set of points (J{(A), J3(A)), as A varies over all positive 
numbers. 


15.6 


15.7 


Exercises 
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(a) Bi-objective problems without trade-off. Suppose p and y are different positive 
weights, and ĉ(u) = @(y). Show that @(A) is constant for all A > 0. Therefore 
the point (Jf (A), JŽ (A)) is the same for all A and the trade-off curve collapses to a 
single point. 


(b) Effect of weight on objectives in a bi-objective problem. Suppose £(A) is not constant. 
Show the following: for A < u, we have 


Ji(A) < J(u), A) > J(u). 


This means that if you increase the weight (on the second objective), the second 
objective goes down, and the first objective goes up. In other words the trade-off 
curve slopes downward. 


Hint. Resist the urge to write out any equations or formulas. Use the fact that 
&(A) is the unique minimizer of Ji (a) + AJ2(x), and similarly for ĉ(u), to deduce 
the inequalities 


Ji (u) +AJ3 (u) > JL (A) + AS2(A), JEA) + aJ (A) > Ji (u) + uJ? (y). 


Combine these inequalities to show that J} (A) < J (u) and J3 (A) > J3(p). 
(c) Slope of the trade-off curve. The slope of the trade-off curve at the point (J] (A), J3(A)) 
is given by 
_ J(u) — ZA) 
lim . 
wor J? (u) = ITA) 
(This limit is the same if 4 approaches A from below or from above.) Show that 


S = —1/A. This gives another interpretation of the parameter A: (Jý (A), J3 (A)) is 
the point on the trade-off curve where the curve has slope —1/A. 


S= 


Hint. First assume that u approaches » from above (meaning, u > A) and use the 
inequalities in the hint for part (b) to show that S > —1/A. Then assume that u 
approaches À from below and show that S < —1/A. 


Least squares with smoothness regularization. Consider the weighted sum least squares 
objective 
Az — bll? + Alar, 


where the n-vector x is the variable, A is an m x n matrix, D is the (n — 1) x n difference 
matrix, with ith row (e:41—e:)", and A > 0. Although it does not matter in this problem, 
this objective is what we would minimize if we want an x that satisfies Ax ~ b, and has 
entries that are smoothly varying. We can express this objective as a standard least 
squares objective with a stacked matrix of size (m + n — 1) x n. 


Show that the stacked matrix has linearly independent columns if and only if Al Æ 0, 
i.e., the sum of the columns of A is not zero. 


Greedy regulation policy. Consider a linear dynamical system given by 1441 = Axı + Bur, 
where the n-vector 2; is the state at time t, and the m-vector uz is the input at time t. The 
goal in regulation is to choose the input so as to make the state small. (In applications, 
the state x; = 0 corresponds to the desired operating point, so small x; means the state 
is close to the desired operating point.) One way to achieve this goal is to choose uz so 
as to minimize 


lell? + plleell?, 


where p is a (given) positive parameter that trades off using a small input versus making 
the (next) state small. Show that choosing uz this way leads to a state feedback policy 
ut = Kat, where K is an m x n matrix. Give a formula for K (in terms of A, B, and p). 
If an inverse appears in your formula, state the conditions under which the inverse exists. 


Remark. This policy is called greedy or myopic since it does not take into account the 
effect of the input uz on future states, beyond x:+1. It can work very poorly in practice. 
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15.8 


15.9 


Estimating the elasticity matriz. In this problem you create a standard model of how 
demand varies with the prices of a set of products, based on some observed data. There 
are n different products, with (positive) prices given by the n-vector p. The prices are 
held constant over some period, say, a day. The (positive) demands for the products over 
the day are given by the n-vector d. The demand in any particular day varies, but it is 
thought to be (approximately) a function of the prices. 


The nominal prices are given by the n-vector p"°™. You can think of these as the prices 
that have been charged in the past for the products. The nominal demand is the n-vector 
d°™. This is the average value of the demand, when the prices are set to p”°™. (The 
actual daily demand fluctuates around the value d"°™.) You know both p"°™ and d"°™. 
We will describe the prices by their (fractional) variations from the nominal values, and 
the same for demands. We define 5? and ô? as the (vectors of) relative price change and 
demand change: 


i nom 3 qnom ? 
i 


Pi 


1 = Luay 


So 63 = +0.05 means that the price for product 3 has been increased by 5% over its 


nominal value, and 64 = —0.04 means that the demand for product 5 in some day is 4% 
below its nominal value. 


Your task is to build a model of the demand as a function of the price, of the form 
ôt x Eô”, 


where E is the n x n elasticity matrix. You don’t know E, but you do have the results 
of some experiments in which the prices were changed a bit from their nominal values for 
one day, and the day’s demands were recorded. This data has the form 


(p1,d1),. a) (pn, dn), 


where p; is the price for day i, and d; is the observed demand. 


Explain how you would estimate Æ, given this price-demand data. Be sure to explain how 
you will test for, and (if needed) avoid over-fit. Hint. Formulate the problem as a matrix 
least squares problem; see page 233. 


Remark. Note the difference between elasticity estimation and demand shaping, discussed 
on page 315. In demand shaping, we know the elasticity matrix and are choosing prices; 
in elasticity estimation, we are guessing the elasticity matrix from some observed price 
and demand data. 


Regularizing stratified models. In a stratified model (see page 272), we divide the data 
into different sets, depending on the value of some (often Boolean) feature, and then fit 
a separate model for each of these two data sets, using the remaining features. As an 
example, to develop a model of some health outcome we might build a separate model for 
women and for men. In some cases better models are obtained when we encourage the 
different models in a stratified model to be close to each other. For the case of stratifying 
on one Boolean feature, this is done by choosing the two model parameters 0 and 6°? 
to minimize 


AV eM _ y |? $ Ao = y ||? 4 ro = o], 


where à > 0 is a parameter. The first term is the least squares residual for the first model 
on the first data set (say, women); the second term is the least squares residual for the 
second model on the second data set (say, men); the third term is a regularization term 
that encourages the two model parameters to be close to each other. Note that when 
A = 0, we simply fit each model separately; when A is very large, we are basically fitting 
one model to all the data. Of course the choice of an appropriate value of À is obtained 
using out-of-sample validation (or cross-validation). 


15.10 


15.11 


Exercises 
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(a) Give a formula for the optimal (6,6). (If your formula requires one or more 
matrices to have linearly independent columns, say so.) 


(b) Stratifying across age groups. Suppose we fit a model with each data point repre- 
senting a person, and we stratify over the person’s age group, which is a range of 
consecutive ages such as 18-24, 24-32, 33-45, and so on. Our goal is to fit a model 
for each age of k groups, with the parameters for adjacent age groups similar, or not 
too far, from each other. Suggest a method for doing this. 


Estimating a periodic time series. (See §15.3.2.) Suppose that the T-vector y is a mea- 
sured time series, and we wish to approximate it with a P-periodic T-vector. For simplic- 
ity, we assume that T = KP, where K is an integer. Let ĝ be the simple least squares fit, 
with no regularization, i.e., the P-periodic vector that minimizes ||j—y||?. Show that for 
i= 1,..., P — 1, we have 


K 
p 1 
Yi = K 2 Yit(k-1)P- 


In other words, each entry of the periodic estimate is the average of the entries of the 
original vector over the corresponding indices. 


General pseudo-inverse. In chapter 11 we encountered the pseudo-inverse of a tall matrix 
with linearly independent columns, a wide matrix with linearly independent rows, and 
a square invertible matrix. In this exercise we describe the pseudo-inverse of a general 
matrix, i.e., one that does not fit these categories. The general pseudo-inverse can be 
defined in terms of Tikhonov regularized inversion (see page 317). Let A be any matrix, 
and A > 0. The Tikhonov regularized approximate solution of Ar = b, i.e., unique 
minimizer of || Aa — b||? + Alja||?, is given by (ATA + AI)~1A7b. The pseudo-inverse of A 
is defined as 
A’ = lim(A7A+AI)'A™. 
A->0 
In other words, Afb is the limit of the Tikhonov-regularized approximate solution of 
Az = b, as the regularization parameter converges to zero. (It can be shown that this 
limit always exists.) Using the kernel trick identity (15.10), we can also express the 
pseudo-inverse as 
At = lim A7(AA™ +A). 
A->0 
(a) What is the pseudo-inverse of the m x n zero matrix? 


(b) Suppose A has linearly independent columns. Explain why the limits above reduce 
to our previous definition, AT = (AT A)~"A’. 

(c) Suppose A has linearly independent rows. Explain why the limits above reduce to 
our previous definition, A? = A7(AA7)~?. 


Hint. For parts (b) and (c), you can use the fact that the matrix inverse is a continuous 
function, which means that the limit of the inverse of a matrix is the inverse of the limit, 
provided the limit matrix is invertible. 


16.1 


Chapter 16 


Constrained least squares 


In this chapter we discuss a useful extension of the least squares problem that 
includes linear equality constraints. Like least squares, the constrained least squares 
problem can be reduced to a set of linear equations, which can be solved using the 
QR factorization. 


Constrained least squares problem 


In the basic least squares problem, we seek x that minimizes the objective function 
|| Ax — b||?. We now add constraints to this problem, by insisting that x satisfy 
the linear equations Cx = d, where the matrix C and the vector d are given. 
The linearly constrained least squares problem (or just constrained least squares 
problem) is written as 

minimize || Ax — b||? 


subject to Ca = d. a) 


Here x, the variable to be found, is an n-vector. The problem data (which are 
given) are the m x n matrix A, the m-vector b, the p x n matrix C, and the 
p-vector d. 

We refer to the function || Ax — b||? as the objective of the problem, and the set 
of p linear equality constraints Ca = d as the constraints of the problem. They 
can be written out as p scalar constraints (equations) 


T $ 
Gt= fy t=1,...,p, 


where c? is the ith row of C. 

An n-vector x is called feasible (for the problem (16.1)) if it satisfies the con- 
straints, i.e., Cx = d. An n-vector ĉ is called an optimal point or solution of the 
optimization problem (16.1) if it is feasible, and if || Aĉ — b||? < || Ax — b||? holds for 
any feasible x. In other words, ĉ solves the problem (16.1) if it is feasible and has 
the smallest possible value of the objective function among all feasible vectors. 

The constrained least squares problem combines the problems of solving a set 
of linear equations (find x that satisfies Ca = d) with the least squares problem 
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Figure 16.1 Least squares fit of two cubic polynomials to 140 points, with 
continuity constraints p(a) = q(a) and p’(a) = q' (a). 


(find x that minimizes || Ax —b||?). Indeed each of these problems can be considered 
a special case of the constrained least squares problem (16.1). 

The constrained least squares problem can also be thought of as a limit of a bi- 
objective least squares problem, with primary objective || Ax — b||? and secondary 
objective ||Cx — d||?. Roughly speaking, we put infinite weight on the second 
objective, so that any nonzero value is unacceptable (which forces x to satisfy 
Cx = d). So we would expect (and it can be verified) that minimizing the weighted 
objective 

|| Ax — b||? + Al|Ca — dll’, 


for a very large value of À yields a vector close to a solution of the constrained least 
squares problem (16.1). We will encounter this idea again in chapter 19, when we 
consider the nonlinear constrained least squares problem. 


Example. In figure 16.1 we fit a piecewise-polynomial function f(x) to a set of 
N = 140 points (a;, yi) in the plane. The function f(x) is defined as 


with a given, and p(x) and q(x) polynomials of degree three or less, 
p(x) = 01 + 02% + 0327 + 042°, q(x) = 05 + Og + 07x? + Oger. 


We also impose the condition that p(a) = q(a) and p'(a) = q'(a), so that f(z) is 
continuous and has a continuous first derivative at x = a. Suppose the N data 
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points (x;,y;) are numbered so that 7,...,0. < a and xyy41,...,unN >a. The 
sum of squares of the prediction errors is 


M N 
So + O20; + 03x? + 04x? — yi)? + 5 (05 + Oszi +072? + 082? — yi)”. 
i=l i=M+1 


The conditions p(a) — q(a) = 0 and p'(a) — q' (a) = 0 are two linear equations 


0, + 02a + 0307 + 04a? — 05 — Osa — 07a? — Osa? = 0 
b2 + 203a + 30407 — 06 — 207a — 38a? = O0. 
We can determine the coefficients Ê = (ô, Pi , ôs) that minimize the sum of squares 


of the prediction errors, subject to the continuity constraints, by solving a con- 
strained least squares problem 


minimize || A@ — b||? 
subject to C0 = d. 


The matrices and vectors A, b, C, d are defined as 


1 wy a; a? 0 0 0 0 yı 

1 a z? 22 O 0 0 0 Y2 
A — 1 LM Ta TM 0 0 0 0 b = YM 

0 0 0 0 1 TM+1 EMi Tpi i YM+1 

0 0 0 0 1 zm42 iw Bins YM+2 

0 0 0 0 1 TN Ey z3; YN 

and 
C= l a a @ 1 a a? a® d= 0 
~10 1 2a 3 0 -1 -2a —3a? |’ Jol? 


This method is easily extended to piecewise-polynomial functions with more than 
two intervals. Functions of this kind are called splines. 


Advertising budget allocation. We continue the example described on page 234, 
where the goal is to purchase advertising in n different channels so as to achieve 
(or approximately achieve) a target set of customer views or impressions in m 
different demographic groups. We denote the n-vector of channel spending as s; 
this spending results in a set of views (across the demographic groups) given by the 
m-vector Rs. We will minimize the sum of squares of the deviation from the target 
set of views, given by v%°s. In addition, we fix our total advertising spending, with 
the constraint 17s = B, where B is a given total advertising budget. (This can 
also be described as allocating a total budget B across the n different channels.) 
This leads to the constrained least squares problem 


minimize ||Rs — v35 ||? 
subject to 17s=B. 
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Figure 16.2 Advertising with budget constraint. The ‘optimal’ views vector 
is the solution of the constrained least squares problem with budget con- 
straint. The ‘scaled’ views vector is obtained by scaling the unconstrained 
least squares solution so that it satisfies the budget constraint. This is a 
scalar multiple of the views vector of figure 12.4. 


(The solution § of this problem is not guaranteed to have nonnegative entries, as it 
must to make sense in this application. But we ignore this aspect of the problem 
here.) 

We consider the same problem instance as on page 234, with m = 10 demo- 
graphic groups and n = 3 channels, and reach matrix R given there. The least 
squares method yields an RMS error of 133 (around 13.3%), with a total budget 
of 17s! = 1605. We seek a spending plan with a budget that is 20% smaller, 
B = 1284. Solving the associated constrained least squares problem yields the 
spending vector s% = (315,110,859), which has RMS error of 161 in the target 
views. We can compare this spending vector to the one obtained by simply scaling 
the least squares spending vector by 0.80. The RMS error for this allocation is 239. 
The resulting impressions for both spending plans are shown in figure 16.2. 


Least norm problem 


An important special case of the constrained least squares problem (16.1) is when 
A=TI and b=0: 
minimize — ||z||? 


subject to Ca=d. (10.2) 


In this problem we seek the vector of smallest or least norm that satisfies the linear 
equations Cx = d. For this reason the problem (16.2) is called the least norm 
problem or minimum-norm problem. 


Force 


16.1 Constrained least squares problem 343 


Position 


Time Time 


Figure 16.3 Left: A force sequence f™ = (1,—1,0,...,0) that transfers the 
mass over a unit distance in 10 seconds. Right: The resulting position of 
the mass p(t). 


Example. The 10-vector f represents a series of forces applied, each for one sec- 
ond, to a unit mass on a surface with no friction. The mass starts with zero velocity 
and position. By Newton’s laws, its final velocity and position are given by 


vm = fit fete + fio 
p™ = (19/2) fr + (17/2) fo + +++ + (1/2) fio: 


(See exercise 2.3.) 

Now suppose we want to choose a force sequence that results in v'® = 0, p* = 1, 
i.e., a force sequence that moves the mass to a resting position one meter to the 
right. There are many such force sequences; for example f°? = (1,—1,0,...,0). 
This force sequence accelerates the mass to velocity 0.5 after one second, then 
decelerates it over the next second, so it arrives after two seconds with velocity 0, 
at the destination position 1. After that it applies zero force, so the mass stays 
where it is, at rest at position 1. The superscript ‘bb’ refers to bang-bang, which 
means that a large force is applied to get the mass moving (the first ‘bang’) and 
another large force (the second ‘bang’) is then applied to slow it to zero velocity. 
The force and position versus time for this choice of f are shown in figure 16.3. 

Now we ask, what is the smallest force sequence that can achieve v'™ = 0, 
pł” = 1, where smallest is measured by the sum of squares of the applied forces, 
IIZI? = f? +---+ f? This problem can be posed as a least norm problem, 


minimize — || f||? 
Ko i e a) 4 0 
A EAE | 19/2 17/2 --- 3/2 1/2 |= | 1 ls 


with variable f. The solution f™, and the resulting position, are shown in fig- 
ure 16.4. The norm square of the least norm solution f™ is 0.0121; in contrast, 
the norm square of the bang-bang force sequence is 2, a factor of 165 times larger. 
(Note the very different vertical axis scales in figures 16.4 and 16.3.) 
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Figure 16.4 Left: The smallest force sequence f™ that transfers the mass 
over a unit distance in 10 steps. Right: The resulting position of the mass 


p(t). 


16.2 Solution 


Optimality conditions via Lagrange multipliers. We will use the method of La- 
grange multipliers (developed by the mathematician Joseph-Louis Lagrange, and 
summarized in §C.3) to solve the constrained least squares problem (16.1). Later 
we give an independent verification, that does not rely on calculus or Lagrange 
multipliers, that the solution we derive is correct. 

We first write the constrained least squares problem with the constraints given 
as a list of p scalar equality constraints: 


minimize || Ax — b||? 
subject to cla=d;, i=1,...,p, 


4 


T 


where c; are the rows of C. We form the Lagrangian function 


L(x, z) = || Ax — b||? + (cj £ — di) +--+ + 2p(c) £ — dp), 


where z is the p-vector of Lagrange multipliers. The method of Lagrange multipliers 
tells us that if ĉ is a solution of the constrained least squares problem, then there 
is a set of Lagrange multipliers 2 that satisfy 


OL OL, a 
x fe g died a erty 6 (16.3) 


These are the optimality conditions for the constrained least squares problem. Any 

solution of the constrained least squares problem must satisfy them. We will now 

see that the optimality conditions can be expressed as a set of linear equations. 
The second set of equations in the optimality conditions can be written as 


OL 


gz Ê) =G- di = 0, i=1,...,p, 
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which states that ĉ satisfies the equality constraints Ci = d (which we already 
knew). The first set of equations, however, is more informative. Expanding the 
objective || Ax — b||? as a sum of terms involving the entries of x (as was done on 
page 229) and taking the partial derivative of L with respect to x; we obtain 


ðL a 
ag ê? a=20 (a? A)ijĉj — 2(ATb); + X 2;(cj)i = 0. 
(3 j=1 


j=l 


These equations can be written in compact matrix-vector form as 
2(AT A)ê — 2ATb + C72 = 0. 


Combining this set of linear equations with the feasibility conditions Cz = d, we 
can write the optimality conditions (16.3) as one set of n + p linear equations in 


the variables (ĉ, 2): 
2ATA OT & 2ATb 
pe ERPE] as 


These equations are called the KKT equations for the constrained least squares 
problem. (KKT are the initials of the last names of William Karush, Harold 
Kuhn, and Albert Tucker, the three researchers who derived the optimality con- 
ditions for a more general form of constrained optimization problem.) The KKT 
equations (16.4) are an extension of the normal equations (12.4) for a least squares 
problem with no constraints. So we have reduced the constrained least squares 
problem to the problem of solving a (square) set of n + p linear equations in n + p 
variables (4, 2). 


Invertibility of KKT matrix. The (n + p) x (n + p) coefficient matrix in (16.4) is 
called the KKT matriz. It is invertible if and only if 


C has linearly independent rows, and | 7 has linearly independent columns. 


(16.5) 
The first condition requires that C is wide (or square), i.e., that there are fewer 
constraints than variables. The second condition depends on both A and C, and 
it can be satisfied even when the columns of A are linearly dependent. The con- 
dition (16.5) is the generalization of our assumption (12.2) for unconstrained least 
squares (i.e., that A has linearly independent columns). 

Before proceeding, let us verify that the KKT matrix is invertible if and only 
if (16.5) holds. First suppose that the KKT matrix is not invertible. This means 
that there is a nonzero vector (%, Z) with 

| =0. 


| 2ATA CT | 
Multiply the top block equation 2AT Az + CTZ = 0 on the left by ZT to get 


x 8 


C 0 


2|| Az||? +270CTz = 0. 
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The second block equation, Cz = 0, implies (by taking the transpose) 7C7 = 0, 
so the equation above becomes 2|| AZ||? = 0, i.e., AZ = 0. We also have CZ = 0, so 


Since the matrix on the left has linearly independent columns (by assumption), 
we conclude that z = 0. The first block equation above then becomes CTZ = 0. 
But by our assumption that the columns of C7 are linearly independent, we have 
Z=0. So (Z, Z) = 0, which is a contradiction. 

The converse is also true. First suppose that the rows of C are linearly depen- 
dent. Then there is a nonzero vector Z with CTZ = 0. Then 


2ATA CT] 0] _, 
Gai heey 


which shows the KKT matrix is not invertible. Now suppose that the stacked 
matrix in (16.5) has linearly dependent columns, which means there is a nonzero 
vector Z for which 


[Jano 
Direct calculation shows that 

2ATA CT De 0 

C 0 0)? 


which shows that the KKT matrix is not invertible. 
When the conditions (16.5) hold, the constrained least squares problem (16.1) 
has the (unique) solution ĉ, given by 


2 | S | 2ATA CT i | 2ATb le (16.6) 


go c o| | d 


pi 


(This formula also gives us 2, the set of Lagrange multipliers.) From (16.6), we 
observe that the solution ĉ is a linear function of (b, d). 


Direct verification of constrained least squares solution. We will now show 
directly, without using calculus, that the solution ĉ given in (16.6) is the unique 
vector that minimizes || Ax — b||? over all z that satisfy the constraints Cx = d, 
when the conditions (16.5) hold. Let ĉ and ê denote the vectors given in (16.6), so 
they satisfy 

2A7AG+C7Z=2A7, Ch=d. 


Suppose that x Æ ĉ is any vector that satisfies Cx = d. We will show that 
|| Aa — bl? > Ae — BI. 
We proceed in the same way as for the least squares problem: 


|| Aa — bl]? = |\(Aw — Aĉ) + (Aĉ — b)||? 
|| Ax — A#||? + || A& — bl]? + 2(Ax — A&)? (Az — b). 


II 
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Now we expand the last term: 
2(Ax — A&)*(A#—b) = 2(x— #)7A™(AZ -— b) 
= -(«-—#)'CT2 
= —(C(%—#))F2Z 
= 0, 


where we use 2AT (Aĉ — b) = —C72 in the second line and Cx = Cĉ = d in the 
last line. So we have, exactly as in the case of unconstrained least squares, 


Az — bl? = || A(x — ê)? + || A& — bll’, 


from which we conclude that ||Ax — b||? > || A& — b||?. So @ minimizes || Ax — b||? 
subject to Ca = d. 

It remains to show that for x # ĉ, we have the strict inequality || Ax — b||? > 
|| Aĉ — b||?, which by the equation above is equivalent to || A(x — ĉ)||? > 0. If this 
is not the case, then A(x — #) = 0. We also have C(x — ĉ) = 0, and so 


| é |@-a=0 


By our assumption that the matrix on the left has linearly independent columns, 
we conclude that x = ĉ. 


Solving constrained least squares problems 


We can compute the solution (16.6) of the constrained least squares problem by 
forming and solving the KKT equations (16.4). 


Algorithm 16.1 CONSTRAINED LEAST SQUARES VIA KKT EQUATIONS 
given an m x n matrix A and a p x n matrix C that satisfy (16.5), an m-vector b, 
and a p-vector d. 

1. Form Gram matrix. Compute A? A. 


2. Solue KKT equations. Solve KKT equations (16.4) by QR factorization and 
back substitution. 


The second step cannot fail, provided the assumption (16.5) holds. Let us 
analyze the complexity of this algorithm. The first step, forming the Gram matrix, 
requires mn? flops (see page 182). The second step requires the solution of a square 
system of n + p equations, which costs 2(n + p)? flops, so the total is 


mn? + 2(n + p)? 


flops. This grows linearly in m and cubicly in n and p. The assumption (16.5) 
implies p < n, so in terms of order, (n + p)? can be replaced with n. 
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Solving constrained least squares problems via QR factorization. We now give 
a method for solving the constrained least squares problem that generalizes the QR 
factorization method for least squares problems (algorithm 12.1). We assume that 
A and C satisfy the conditions (16.5). 

We start by rewriting the KKT equations (16.4) as 


2(ATA+CTC)E+CTw=2ATb, Ch=d (16.7) 


with a new variable w = Z— 2d. To obtain (16.7) we multiplied the equation 
Cé& = d on the left by 2C7, then added the result to the first equation of (16.4), 
and replaced the variable 2 with w + 2d. 


Next we use the QR factorization 


PE (16.8) 


to simplify (16.7). This factorization exists because the stacked matrix has linearly 
independent columns, by our assumption (16.5). In (16.8) we also partition Q in 
two blocks Qı and Q2, of size m x n and p x n, respectively. If we make the 
substitutions A = Qı R, C = Q2R, and ATA + CTO = RTR in (16.7) we obtain 


2RT Rê + RTQ} w = 2R7Q7d, QoRé = d. 
We multiply the first equation on the left by RT (which we know exists) to get 
Rê = QTb — (1/2)Q2 w. (16.9) 
Substituting this expression into Q2R& = d gives an equation in w: 
QQ} w = 2Q2QTb — 2d. (16.10) 


We now use the second part of the assumption (16.5) to show that the matrix 
QF = R-TCT has linearly independent columns. Suppose QF z = R-TOTz = 0. 
Multiplying with RT gives CTz = 0. Since C has linearly independent rows, this 
implies z = 0, and we conclude that the columns of QF are linearly independent. 

The matrix Q? therefore has a QR factorization QF = QR. Substituting this 
into (16.10) gives 


RT Rw = 2R7Q7 QTd — 2d, 


which we can write as 
Rw = 2Q7Q7Tb- 2R-Tad. 


We can use this to compute w, first by computing R-Td (by forward substitution), 
then forming the right-hand side, and then solving for w using back substitution. 
Once we know w, we can find ĉ from (16.9). The method is summarized in the 
following algorithm. 
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Algorithm 16.2 CONSTRAINED LEAST SQUARES VIA QR FACTORIZATION 


given an m x n matrix A and a p x n matrix C that satisfy (16.5), an m-vector b, 
and a p-vector d. 


1. QR factorizations. Compute the QR factorizations 


a] fQ az 
a]-[8]e a-o 


2. Compute R~7d by forward substitution. 
3. Form right-hand side and solve 
Rw = 2Q7Q7b-2R-7"d 


via back substitution. 


4. Compute ĉ. Form right-hand side and solve 
Rê = Qb — (1/2)Q2w 


by back substitution. 


In the unconstrained case (when p = 0), step 1 reduces to computing the QR 
factorization of A, steps 2 and 3 are not needed, and step 4 reduces to solving 
Rê = QTb. This is the same as algorithm 12.1 for solving (unconstrained) least 
squares problems. 

We now give a complexity analysis. Step 1 involves the QR factorizations of 
an (m +p) x n and an n x p matrix, which costs 2(m + p)n? + 2np? flops. Step 2 
requires p? flops. In step 3, we first evaluate QTb (2mn flops), multiply the result 
by QT (2pn flops), and then solve for w using forward substitution (p? flops). Step 4 
requires 2mn + 2pn flops to form the right-hand side, and n? flops to compute ĉ via 
back substitution. The costs of steps 2, 3, and 4 are quadratic in the dimensions, 
and so are negligible compared to the cost of step 1, so our final complexity is 


2(m + p)n? + 2np* 
flops. The assumption (16.5) implies the inequalities 
p<n<m+p, 


and therefore (m+p)n? > np?. So the flop count above is no more than 4(m +p)n? 
flops. In particular, its order is (m + p)n?. 


Sparse constrained least squares. Constrained least squares problems with sparse 
matrices A and C arise in many applications; we will see several examples in the 
next chapter. Just as for solving linear equations, or (unconstrained) least squares 
problems, there are methods that exploit the sparsity in A and C to solve con- 
strained least squares problems more efficiently than the generic algorithms 16.1 
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or 16.2. The simplest such methods follow these basic algorithms, replacing the 
QR factorizations with sparse QR factorizations (see page 190). 

One potential problem with forming the KKT matrix as in algorithm 16.1 is 
that the Gram matrix AT A can be far less sparse than the matrix A. This problem 
can be avoided using a trick analogous to the one used on page 232 to solve sparse 
(unconstrained) least squares problems. We form the square set of m +n + p linear 
equations 


6 AT oF ĉ 0 
A -(1/2)I 0 ĝ |=|% (16.11) 
C 0 0 â d 


If (ĉ, 9, 2) satisfies these equations, it is easy to see that (ĉ, Z) satisfies the KKT 
equations (16.4); conversely, if (ĉ, 2) satisfies the KKT equations (16.4), (ĉ, ĝ, 2) 
satisfies the equations above, with ĝ = 2(Aĉ — b). Provided A and C are sparse, 
the coefficient matrix above is sparse, and any method for solving a sparse system 
of linear equations can be used to solve it. 


Solution of least norm problem. Here we specialize the solution of the general 
constrained least squares problem (16.1) given above to the special case of the least 
norm problem (16.2). 

We start with the conditions (16.5). The stacked matrix is in this case 


[e]; 


which always has linearly independent columns. So the conditions (16.5) reduce 
to: C has linearly independent rows. We make this assumption now. 
For the least norm problem, the KKT equations (16.4) reduce to 


of OT 
C 0 
We can solve this using the methods for general constrained least squares, or derive 


the solution directly, which we do now. The first block row of this equation is 
2ĉ£ + CT2 = 0, so 


v & 
i 
II 
i 
a o 
us 


ĉ = —(1/2)C7 2. 

We substitute this into the second block equation, Cz = d, to obtain 
—(1/2)CC72 =d. 

Since the rows of C are linearly independent, COT is invertible, so we have 
2 = -2(CCT) td. 

Substituting this expression for 2 into the formula for ĉ above gives 


t=O (CO “a. (16.12) 
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We have seen the matrix in this formula before: It is the pseudo-inverse of a wide 
matrix with linearly independent rows. So we can express the solution of the least 
norm problem (16.2) in the very compact form 


&= Od. 


In §11.5, we saw that CÌ is a right inverse of C; here we see that not only does 
& = C'd satisfy Cx = d, but it gives the vector of least norm that satisfies Ca = d. 

In §11.5, we also saw that the pseudo-inverse of C can be expressed as Ct = 
QRT, where CT = QR is the QR factorization of CT. The solution of the least 
norm problem can therefore be expressed as 


&#=QR Td 


and this leads to an algorithm for solving the least norm problem via the QR 
factorization. 


Algorithm 16.3 LEAST NORM VIA QR FACTORIZATION 


given a p x n matrix C with linearly independent rows and a p-vector d. 


1. QR factorization. Compute the QR factorization CT = QR. 
2. Compute ĉ. Solve R’y = d by forward substitution. 
3. Compute ĉ = Qy. 


The complexity of this algorithm is dominated by the cost of the QR factoriza- 
tion in step 1, i.e., 2np? flops. 
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16.1 


16.2 


16.3 


16.4 


16.5 


16.6 


16.7 


Exercises 


Smallest right inverse. Suppose the m x n matrix A is wide, with linearly independent 
rows. Its pseudo-inverse A' is a right inverse of A. In fact, there are many right inverses 
of A and it turns out that At is the smallest one among them, as measured by the matrix 
norm. In other words, if X satisfies AX = I, then ||X|| > ||A'|]. You will show this in 
this problem. 


(a) Suppose AX = J, and let z1,...,£m denote the columns of X. Let bj denote the 


jth column of At. Explain why ||x;||? > ||b;||?. Hint. Show that z = b; is the vector 
of smallest norm that satisfies Az = ej, for j =1,...,m. 


(b) Use the inequalities from part (a) to establish ||X|| > || AŻ ||. 
Matrix least norm problem. The matrix least norm problem is 


minimize |X]? 
subject to CX =D, 


where the variable to be chosen is the n x k matrix X; the p x n matrix C and the 
p x k matrix D are given. Show that the solution of this problem is xX =C'D, assuming 
the rows of C are linearly independent. Hint. Show that we can find the columns of X 
independently, by solving a least norm problem for each one. 


Closest solution to a given point. Suppose the wide matrix A has linearly independent 
rows. Find an expression for the point x that is closest to a given vector y (i.e., minimizes 
|z — y||?) among all vectors that satisfy Ax = b. 


Remark. This problem comes up when g is some set of inputs to be found, Ax = b 
represents some set of requirements, and y is some nominal value of the inputs. For 
example, when the inputs represent actions that are re-calculated each day (say, because b 
changes every day), y might be yesterday’s action, and the today’s action x found as above 
gives the least change from yesterday’s action, subject to meeting today’s requirements. 


Nearest vector with a given average. Let a be an n-vector and 8 a scalar. How would you 
find the n-vector x that is closest to a among all n-vectors that have average value 8? 
Give a formula for x and describe it in English. 


Checking constrained least squares solution. Generate a random 20 x 10 matrix A and a 
random 5x 10 matrix C. Then generate random vectors b and d of appropriate dimensions 
for the constrained least squares problem 


minimize || Ax — b||? 
subject to Ca = d. 


Compute the solution ĉ by forming and solving the KKT equations. Verify that the 
constraints very nearly hold, i.e., Cĉ-— d is very small. Find the least norm solution 2’ 
of Ca = d. The vector z™ also satisfies Cx = d (very nearly). Verify that || Az™ — b||? > 
|| A& — d||?. 


Modifying a diet to meet nutrient requirements. (Continuation of exercise 8.9.) The 
current daily diet is specified by the n-vector d°™™. Explain how to find the closest diet 
d™°¢ to d°“" that satisfies the nutrient requirements given by the m-vector n°, and has 
the same cost as the current diet d°"". 


Minimum cost trading to achieve target sector exposures. A current portfolio is given 
by the n-vector h™™, with the entries giving the dollar value invested in the n assets. 
The total value (or net asset value) of the portfolio is 17A°"". We seek a new portfolio, 
given by the n-vector h, with the same total value as h°™™. The difference h — h°™ is 
called the trade vector; it gives the amount of each asset (in dollars) that we buy or sell. 
The n assets are divided into m industry sectors, such as pharmaceuticals or consumer 


16.8 


16.9 


16.10 


Exercises 
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electronics. We let the m-vector s denote the (dollar value) sector exposures to the m 
sectors. (See exercise 8.13.) These are given by s = Sh, where S is the m x n sector 
exposure matrix defined by Sij = 1 if asset j is in sector 7 and S;; = 0 if asset j is not 
in sector i. The new portfolio must have a given sector exposure s13, (The given sector 
exposures are based on forecasts of whether companies in the different sectors will do well 
or poorly in the future.) 


Among all portfolios that have the same value as our current portfolio and achieve the 
desired exposures, we wish to minimize the trading cost, given by 


n 


So railh = APY’, 


i=l 


a weighted sum of squares of the asset trades. The weights «x; are positive. (These depend 
on the daily trading volumes of assets, as well as other quantities. In general, it is cheaper 
to trade assets that have high trading volumes.) 


Explain how to find h using constrained least squares. Give the KKT equations that you 
would solve to find h. 


Minimum energy regulator. We consider a linear dynamical system with dynamics x41 = 
Az; + Buz, where the n-vector x+ is the state at time t and the m-vector u+ is the input 
at time t. We assume that x = 0 represents the desired operating point; the goal is to 
find an input sequence w,...,ur—1 that results in xr = 0, given the initial state xı. 
Choosing an input sequence that takes the state to the desired operating point at time T 
is called regulation. 

Find an explicit formula for the sequence of inputs that yields regulation, and minimizes 
uill? +--+ + |/wr—s||?, in terms of A, B, T, and zı. This sequence of inputs is called the 
minimum energy regulator. 


Hint. Express zr in terms of 21, A, the controllability matrix 
C=| AWB AB os AB B], 


and (u1, U2,-..,ur—1) (which is the input sequence stacked). You may assume that C is 
wide and has linearly independent rows. 


Smoothest force sequence to move a mass. We consider the same setup as the example 
given on page 343, where the 10-vector f represents a sequence of forces applied to a unit 
mass over 10 1-second intervals. As in the example, we wish to find a force sequence f 
that achieves zero final velocity and final position one. In the example on page 343, we 
choose the smallest f, as measured by its norm (squared). Here, though, we want the 
smoothest force sequence, i.e., the one that minimizes 


f? H (fo fi)? He + (fro fo)? H fio- 


(This is the sum of the squares of the differences, assuming that fo = 0 and fi1 = 0.) 
Explain how to find this force sequence. Plot it, and give a brief comparison with the 
force sequence found in the example on page 343. 


Smallest force sequence to move a mass to a given position. We consider the same setup 
as the example given on page 343, where the 10-vector f represents a sequence of forces 
applied to a unit mass over 10 1-second intervals. In that example the goal is to find 
the smallest force sequence (measured by ||f||*) that achieves zero final velocity and 
final position one. Here we ask, what is the smallest force sequence that achieves final 
position one? (We impose no condition on the final velocity.) Explain how to find this 
force sequence. Compare it to the force sequence found in the example, and give a brief 
intuitive explanation of the difference. Remark. Problems in which the final position of 
an object is specified, but the final velocity doesn’t matter, generally arise in applications 
that are not socially positive, for example control of missiles. 
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16.11 


16.12 


16.13 


16.14 


Least distance problem. A variation on the least norm problem (16.2) is the least distance 
problem, 

minimize |e — all? 

subject to Cx =d, 


where the n-vector x is to be determined, the n-vector a is given, the p x n matrix C is 
given, and the p-vector d is given. Show that the solution of this problem is 


#=a—C'(Ca—d), 


assuming the rows of C are linearly independent. Hint. You can argue directly from the 
KKT equations for the least distance problem, or solve for the variable y = x — a instead 
of x. 


Least norm polynomial interpolation. (Continuation of exercise 8.7.) Find the polynomial 
of degree 4 that satisfies the interpolation conditions given in exercise 8.7, and minimizes 
the sum of the squares of its coefficients. Plot it, to verify that if satisfies the interpolation 
conditions. 


Steganography via least norm. In steganography, a secret message is embedded in an 
image in such a way that the image looks the same, but an accomplice can decode the 
message. In this exercise we explore a simple approach to steganography that relies on 
constrained least squares. The secret message is given by a k-vector s with entries that 
are all either +1 or —1 (i.e., it is a Boolean vector). The original image is given by the 
n-vector x, where n is usually much larger than k. We send (or publish or transmit) the 
modified message x + z, where z is an n-vector of modifications. We would like z to be 
small, so that the original image x and the modified one «+z look (almost) the same. Our 
accomplice decodes the message s by multiplying the modified image by a k x n matrix 
D, which yields the k-vector y = D(x + z). The message is then decoded as § = sign(y). 
(We write § to show that it is an estimate, and might not be the same as the original.) 
The matrix D must have linearly independent rows, but otherwise is arbitrary. 


a) Encoding via least norm. Let a be a positive constant. We choose z to minimize 
g 
\|z||? subject to D(x+z) = as. (This guarantees that the decoded message is correct, 
i.e., § = s.) Give a formula for z in terms of DÝ, a, and z. 


(b) Complexity. What is the complexity of encoding a secret message in an image? 
(You can assume that DÌ is already computed and saved.) What is the complexity 
of decoding the secret message? About how long would each of these take with a 
computer capable of carrying out 1 Gflop/s, for k = 128 and n = 512? = 262144 (a 
512 x 512 image)? 


(c) Try it out. Choose an image x, with entries between 0 (black) and 1 (white), 
and a secret message s with k small compared to n, for example, k = 128 for a 
512 x 512 image. (This corresponds to 16 bytes, which can encode 16 characters, 
i.e., letters, numbers, or punctuation marks.) Choose the entries of D randomly, and 
compute D'. The modified image x+z may have entries outside the range [0, 1]. We 
replace any negative values in the modified image with zero, and any values greater 
than one with one. Adjust a until the original and modified images look the same, 
but the secret message is still decoded correctly. (If œ is too small, the clipping of 
the modified image values, or the round-off errors that occur in the computations, 
can lead to decoding error, i.e., § Æ s. If a is too large, the modification will be 
visually apparent.) Once you’ve chosen a, send several different secret messages 
embedded in several different original images. 


Invertibility of matrix in sparse constrained least squares formulation. Show that the 
(m+n+p) x (m+n + p) coefficient matrix appearing in equation (16.11) is invertible if 
and only if the KKT matrix is invertible, i.e., the conditions (16.5) hold. 


16.15 


Exercises 


355 


Approximating each column of a matrix as a linear combination of the others. Suppose A 
is an m x n matrix with linearly independent columns aj,...,@n. For each i we consider 
the problem of finding the linear combination of a1,...,@i-1, @i41,...,@n that is closest 
to ai. These are n standard least squares problems, which can be solved using the methods 
of chapter 12. In this exercise we explore a simple formula that allows us to solve these n 
least squares problem all at once. Let G = ATA denote the Gram matrix, and H = G7! 
its inverse, with columns hi,..., hn. 

(a) Explain why minimizing || Av ||? subject to go = —1 solves the problem of finding 
the linear combination of a1,...,@i—-1, @i+1,-.--,@n that is closest to ai. These are n 
constrained least squares problems. 


(b) Solve the KKT equations for these constrained least squares problems, 


2ATA ej z | 0 
er 0 Zi a —1 i 
to conclude that « = —(1/Hi:)hi. In words: x is the ith column of (AT A)~?, 
scaled so its ith entry is —1. 


(c) Each of the n original least squares problems has n — 1 variables, so the complexity 
is n(2m(n — 1)°) flops, which we can approximate as 2mn? flops. Compare this 
to the complexity of a method based on the result of part (b): First find the QR 
factorization of A; then compute H. 


(d) Let d; denote the distance between a; and the linear combination of the other 
columns that is closest to it. Show that d; = 1/V Hi. 


Remark. When the matrix A is a data matrix, with Aj; the value of the jth feature on the 
ith example, the problem addressed here is the problem of predicting each of the features 
from the others. The numbers d; tells us how well each feature can be predicted from the 
others. 


17.1 


17.1.1 


Chapter 17 


Constrained least squares 
applications 


In this chapter we discuss several applications of equality constrained least squares. 


Portfolio optimization 


In portfolio optimization (also known as portfolio selection), we invest in different 
assets, typically stocks, over some investment periods. The goal is to make invest- 
ments so that the combined return on all our investments is consistently high. (We 
must accept the idea that for our average return to be high, we must tolerate some 
variation in the return, i.e., some risk.) The idea of optimizing a portfolio of assets 
was proposed in 1953 by Harry Markowitz, who won the Nobel prize in economics 
for this work in 1990. In this section we will show that a version of this problem 
can be formulated and solved as a linearly constrained least squares problem. 


Portfolio risk and return 


Portfolio allocation weights. We allocate a total amount of money to be invested 
in n different assets. The allocation across the n assets is described by an allocation 
n-vector w, which satisfies 17 w = 1, i.e., its entries sum to one. If a total (dollar) 
amount V is to be invested in some period, then Vw, is the amount invested in 
asset j. (This can be negative, meaning a short position of |Vw,| dollars on asset j.) 
The entries of w are called by various names including fractional allocations, asset 
weights, asset allocations, or just weights. 

For example, the asset allocation w = e; means that we invest everything in 
asset j. (In this way, we can think of the individual assets as simple portfolios.) 
The asset allocation w = (—0.2,0.0,1.2) means that we take a short position in 
asset 1 of one fifth of the total amount invested, and put the cash derived from the 
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short position plus our initial amount to be invested into asset 3. We do not invest 
in asset 2 at all. 
The leverage L of the portfolio is given by 


L = |wi] +--+ [wr], 


the sum of the absolute values of the weights. If all entries of w are nonnegative 
(which is a called a long-only portfolio), we have L = 1; if some entries are negative, 
then L > 1. If a portfolio has a leverage of 5, it means that for every $1 of portfolio 
value, we have $3 of total long holdings, and $2 of total short holdings. (Other 
definitions of leverage are used, for example, (L — 1)/2.) 


Multi-period investing with allocation weights. The investments are held for T 
periods of, say, one day each. (The periods could just as well be hours, weeks, or 
months). We describe the investment returns by the T x n matrix R, where Rij 
is the fractional return of asset 7 in period t. Thus Rg; = 0.02 means that asset 1 
gained 2% in period 6, and Rg2 = —0.03 means that asset 2 lost 3%, over period 8. 
The jth column of R is the return time series for asset 7; the tth row of R gives 
the returns of all assets in period t. It is often assumed that one of the assets is 
cash, which has a constant (positive) return uf, where the superscript stands for 
risk-free. If the risk-free asset is asset n, then the last column of R is p™‘1. 

Suppose we invest a total (positive) amount V; at the beginning of period t, 
so we invest Vw; in asset j. At the end of period t, the dollar value of asset j is 
V;w;(1 + R:;), and the dollar value of the whole portfolio is 


Vier = X Vw (l + Ry) = Vil + FF w), 


j=1 


where FT is the tth row of R. We assume V;+ı is positive; if the total portfolio 
value becomes negative we say that the portfolio has gone bust and stop trading. 

The total (fractional) return of the portfolio over period t, i.e., its fractional 
increase in value, is 


V1 — V V1 +7Tw)-— V, 
t+1 t (1 +7; w) am a 


V: V: 


Note that we invest the total portfolio value in each period according to the 
weights w. This entails buying and selling assets so that the dollar value frac- 
tions are once again given by w. This is called re-balancing the portfolio. 
The portfolio return in each of the T periods can be expressed compactly using 
matrix-vector notation as 
r= Rw, 


where r is the T-vector of portfolio returns in the T periods, i.e., the time series 
of portfolio returns. (Note that r is a T-vector, which represents the time series 
of total portfolio return, whereas f+ is an n-vector, which gives the returns of the 
n assets in period t.) If asset n is risk-free, and we choose the allocation w = en, 
then r = Ren = p1, i.e., we obtain a constant return in each period of "f. 
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We can express the total portfolio value in period t as 
Vi = Va(1 + r1)(1 + r2) (1+ r1), (17.1) 


where Vj is the total amount initially invested in period t = 1. This total value time 
series is often plotted using Vı = $10000 as the initial investment by convention. 
The product in (17.1) arises from re-investing our total portfolio value (including 
any past gains or losses) in each period. In the simple case when the last asset is 
risk-free and we choose w = en, the total value grows as V; = V1 (1 + p!)'!. This 
is called compounded interest at rate p™. 

When the returns r; are small (say, a few percent), and T is not too big (say, a 
few hundred), we can approximate the product above using the sum or average of 
the returns. To do this we expand the product in (17.1) into a sum of terms, each 
of which involves a product of some of the returns. One term involves none of the 
returns, and is V;. There are t — 1 terms that involve just one return, which have 
the form Virs, for s = 1,...,t— 1. All other terms in the expanded product involve 
the product of at least two returns, and so can be neglected since we assume that 
the returns are small. This leads to the approximation 


Vi ~ Vi + Vilri t+: + rea), 
which for t = T + 1 can be written as 
Vra1 % Vi + T avg(r)Vi. 


This approximation suggests that to maximize our total final portfolio value, we 
should seek high return, i.e., a large value for avg(r). 


Portfolio return and risk. The choice of weight vector w is judged by the result- 
ing portfolio return time series r = Rw. The portfolio mean return (over the T 
periods), often shortened to just the return, is given by avg(r). The portfolio risk 
(over the T periods) is the standard deviation of portfolio return, std(r). 

The quantities avg(r) and std(r) give the per-period return and risk. They 
are often converted to their equivalent values for one year, which are called the 
annualized return and risk, and reported as percentages. If there are P periods in 
one year, these are given by 


P avg(r), VPstd(r), 


respectively. For example, suppose each period is one (trading) day. There are 
about 250 trading days in one year, so the annualized return and risk are given 
by 250avg(r) and 15.81 std(r). Thus a daily return sequence r with per-period 
(daily) return 0.05% (0.0005) and risk 0.5% (0.005) has an annualized return and 
risk of 12.5% and 7.9%, respectively. (The squareroot of P in the risk annualization 
comes from the assumption that the fluctuations in the returns vary randomly and 
independently from period to period.) 
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17.1.2 


Portfolio optimization 


We want to choose w so that we achieve high return and low risk. This means 
that we seek portfolio returns r that are consistently high. This is an optimization 
problem with two objectives, return and risk. Since there are two objectives, there 
is a family of solutions, that trade off return and risk. For example, when the 
last asset is risk-free, the portfolio weight w = en achieves zero risk (which is the 
smallest possible value), and return uf. We will see that other choices of w can 
lead to higher return, but higher risk as well. Portfolio weights that minimize risk 
for a given level of return (or maximize return for a given level of risk) are called 
Pareto optimal. The risk and return of this family of weights are typically plotted 
on a risk-return plot, with risk on the horizontal axis and return on the vertical 
axis. Individual assets can be considered (very simple) portfolios, corresponding 
to w = ej. In this case the corresponding portfolio return and risk are simply the 
return and risk of asset j (over the same T periods). 

One approach is to fix the return of the portfolio to be some given value p, and 
minimize the risk over all portfolios that achieve the required return. Doing this 
for many values of p produces (different) portfolio allocation vectors that trade off 
risk and return. Requiring that the portfolio return be p can be expressed as 


avg(r) = (1/T)1" (Rw) = pw = p, 


where u = RT1/T is the n-vector of the average asset returns. This is a single 
linear equation in w. Assuming that it holds, we can express the square of the risk 
as 

std(r)? = (1/T)||r — avg(r)1||? = (1/T)||r — p11. 
Thus to minimize risk (squared), with return value p, we must solve the linearly 
constrained least squares problem 


minimize || Rw — p1||? 


T 
subject to | [e = | 4 | : ie?) 
u p 


(We dropped the factor 1/T from the objective, which does not affect the solution.) 
This is a constrained least squares problem with two linear equality constraints. 
The first constraint sets the sum of the allocation weights to one, and the second 
requires that the mean portfolio return is p. 

The portfolio optimization problem has the solution 


w 2RTR 1 w 2pT p 
z |= 17 0 0 1 , (17.3) 
Z2 u? 0 0 p 


where zı and z2 are Lagrange multipliers for the equality constraints (which we 
don’t care about). 

As a historical note, the portfolio optimization problem (17.2) is not exactly 
the same as the one proposed by Markowitz. His formulation used a statistical 
model of returns, where instead we are using a set of actual (or realized) returns. 
(See exercise 17.2 for a formulation of the problem that is closer to the original 
formulation by Markowitz.) 
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Future returns and the big assumption. The portfolio optimization problem (17.2) 
suffers from what would appear to be a serious conceptual flaw: It requires us to 
know the asset returns over the periods t = 1,...,T, in order to compute the op- 
timal allocation to use over those periods. This is silly: If we knew any future 
returns, we would be able to achieve as large a portfolio return as we like, by sim- 
ply putting large positive weights on the assets with positive returns and negative 
weights on those with negative returns. The whole challenge in investing is that 
we do not know future returns. 


Assume the current time is period T, so we know the (so-called realized) return 
matrix R. The portfolio weight w found by solving (17.2), based on the observed 


returns in periods t = 1,...,T, can still be useful, when we make one (big) as- 
sumption: 

Future asset returns are similar to past returns. (17.4) 
In other words, if the asset returns for future periods T+ 1,T + 2,... are similar 


in nature to the past periods t = 1,...,T, then the portfolio allocation w found by 
solving (17.2) could be a wise choice to use in future periods. 


Every time you invest, you are warned that the assumption (17.4) need not 
hold; you are required to acknowledge that past performance is no guarantee of 
future performance. The assumption (17.4) often holds well enough to be useful, 
but in times of ‘market shift’ it need not. 


This situation is similar to that encountered when fitting models to observed 
data, as in chapters 13 and 14. The model is trained on past data that you have 
observed; but it will be used to make predictions on future data that you have not 
yet seen. A model is useful only to the extent that future data looks like past data. 
And this is an assumption which often (but not always) holds reasonably well. 


Just as in model fitting, investment allocation vectors can (and should) be 
validated before being used. For example, we determine the weight vector by 
solving (17.2) using past returns data over some past training period, and check 
the performance on some other past testing period. If the portfolio performance 
over the training and testing periods are reasonably consistent, we gain confidence 
(but no guarantee) that the weight vector will work in future periods. For example, 
we might determine the weights using the realized returns from two years ago, and 
then test these weights by the performance of the portfolio over last year. If the test 
works out, we use the weights for next year. In portfolio optimization, validation 
is sometimes called back-testing, since you are testing the investment method on 
previous realized returns, to get an idea of how the method will work on (unknown) 
future returns. 


The basic assumption (17.4) often holds less well than the analogous assumption 
in data fitting, i.e., that future data looks like past data. For this reason we expect 
less coherence between the training and test performance of a portfolio, compared 
to a generic data fitting application. This is especially so when the test period has 
a small number of periods in it, like 100; see the discussion on page 268. 
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17.1.3 


Risk 


Figure 17.1 The open circles show annualized risk and return for 20 assets 
(19 stocks and one risk-free asset with a return of 1%). The solid line shows 
risk and return for the Pareto optimal portfolios. The dots show risk and 
return for three Pareto optimal portfolios with 10%, 20%, and 40% return, 
and the portfolio with weights w; = 1/n. 


Example 


We use daily return data for 19 stocks over a period of 2000 days (8 years). After 
adding a risk-free asset with a 1% annual return, we obtain a 2000 x 20 return 
matrix R. The circles in figure 17.1 show the annualized risk and return for the 20 
assets, i.e., the points 


/ 250 std(Re;) 
250 avg(Re;) |’ 


It also shows the Pareto-optimal risk-return curve, and the risk and return for 
the uniform portfolio with equal weights w; = 1/n. The annualized risk, return, 
and leverage for five portfolios (the four Pareto-optimal portfolios indicated in the 
figure, and the 1/n portfolio) are given in table 17.1. 


Figure 17.2 shows the total portfolio value (17.1) for the five portfolios. Fig- 
ure 17.3 shows the portfolio values for a different test period of 500 days (two 
years). 


17.1 Portfolio optimization 363 


Return Risk 


Portfolio Train Test ‘Train Test Leverage 


Risk-free 0.01 0.01 0.00 0.00 1.00 


10% 0.10 0.08 0.09 0.07 1.96 
20% 0.20 0.15 0.18 0.15 3.03 
40% 0.40 0.30 0.38 0.31 5.48 
1/n 0.10 0.21 0.23 0.13 1.00 


Table 17.1 Annualized risk, return, and leverage for five portfolios. 


‘| 40% 
a 
nD 
5 
© 
5 
go 
g 
F 
3 
3 
= Es 
S 10% 
ie 
— 1/n 
le 
| | | | Risk-free 
0 400 800 1200 1600 2000 


Day 


Figure 17.2 Total value over time for five portfolios: the risk-free portfolio 
with 1% annual return, the Pareto optimal portfolios with 10%, 20%, and 
40% return, and the uniform portfolio. The total value is computed using 
the 2000 x 20 daily return matrix R. 
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17.1.4 
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Figure 17.3 Value over time for the five portfolios in figure 17.2 over a test 
period of 500 days. 


Variations 


There are many variations on the basic portfolio optimization problem (17.2). We 
describe a few of them here; a few others are explored in the exercises. 


Regularization. Just as in data fitting, our formulation of portfolio optimization 
can suffer from over-fit, which means that the chosen weights perform very well on 
past (realized) returns, but poorly on new (future) returns. Over-fit can be avoided 
or reduced by adding regularization, which here means to penalize investments in 
assets other than cash. (This is analogous to regularization in model fitting, where 
we penalize the size of the model coefficients, except for the coefficient associated 
with the constant feature.) One natural way to incorporate regularization in the 
portfolio optimization problem (17.2) is to add a positive multiple A of the weighted 
sum of squares term 


2.2 2 2 
owi +--+ +07, -4W,-1 


to the objective in (17.2). Note that we do not penalize wn, which is the weight 
associated with the risk-free asset. The constants g; are the standard deviations of 
the (realized) returns, i.e., 0; = std(Re;). This regularization penalizes weights as- 
sociated with risky assets more than those associated with less risky assets. A good 
choice of À can be found by back-testing. 
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Time-varying weights. Markets do shift, so it is not uncommon to periodically 
update or change the allocation weights that are used. In one extreme version of 
this, a new allocation vector is used in every period. The allocation weight for any 
period is obtained by solving the portfolio optimization problem over the preceding 
M periods. (This scheme can be modified to include testing periods as well.) The 
parameter M in this method would be chosen by validation on previous realized 
returns, i.e., back-testing. 

When the allocation weights are changed over time, we can add a (regulariza- 
tion) term of the form «||w°"" — w||? to the objective, where « is a positive constant. 
Here w°™" is the currently used allocation, and w is the proposed new allocation 
vector. The additional regularization term encourages the new allocation vector 
to be near the current one. (When this is not the case, the portfolio will require 
excessive buying and selling of assets. This is called turnover, which leads to trad- 
ing costs not included in our simple model.) The parameter « would be chosen by 
back-testing, taking into account an approximation of trading cost. 


Two-fund theorem 


We can express the solution (17.3) of the portfolio optimization problem in the 
form 


=) =l 


w 2RTR 1 p 0 2RTR 1 pu 2T u 
z | = 17 0 0 1]+p 17 0 0 0 
22 u? 0 0 0 u? 0 0 1 
Taking the first n components of this, we obtain 
w = w? + pv, (17.5) 
where w° and v are the first n components of the (n + 2)-vectors 
2RTR 1 p] [0 2RTR 1 p| `f 2Tu 
17 0 0 1 |, 17 0 0 0 |, 
u? 0 0 0 u? 0 0 1 


respectively. The equation (17.5) shows that the Pareto optimal portfolios form a 
line in weight space, parametrized by the required return p. The portfolio w® is a 
point on the line, and the vector v, which satisfies 17v = 0, gives the direction of 
the line. This equation tells us that we do not need to solve the equation (17.3) for 
each value of p. We first compute w? and v (by factoring the matrix once and using 
two solve steps), and then form the optimal portfolio with return p as w? + pv. 

Any point on a line can be expressed as an affine combination of two different 
points on the line. So if we find two different Pareto optimal portfolios, then we 
can express a general Pareto optimal portfolio as an affine combination of them. 
In other words, all Pareto optimal portfolios are affine combinations of just two 
portfolios (indeed, any two different Pareto optimal portfolios). This is the two- 
fund theorem. (Fund is another term for portfolio.) 
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17.2 


Now suppose that the last asset is risk-free. The portfolio w = en is Pareto 
optimal, since it achieves return py! with zero risk. We then find one other Pareto 
optimal portfolio, for example, the one w? that achieves return 2u"f, twice the 
risk-free return. (We could choose here any return other than p™.) Then we can 
express the general Pareto optimal portfolio as 


w = (1 — b)en + Ow”, 


where 0 = p/p" — 1. 


Linear quadratic control 


We consider a time-varying linear dynamical system with state n-vector x; and 
input m-vector uz, with dynamics equations 


Tti = Aixi t+ Bru, t=1,2,.... (17.6) 
The system has an output, the p-vector y,, given by 
Ut = Cift i= Lei ees. (17.7) 


Usually, m < n and p < n, i.e., there are fewer inputs and outputs than states. 

In control applications, the input uz represents quantities that we can choose 
or manipulate, like control surface deflections or engine thrust on an airplane. The 
state £+, input ut, and output y+ typically represent deviations from some standard 
or desired operating condition, for example, the deviation of aircraft speed and 
altitude from the desired values. For this reason it is desirable to have x+, yz, and 
uz small. 

Linear quadratic control refers to the problem of choosing the input and state 
sequences, over a time period t = 1,...,7', so as to minimize a sum of squares 
objective, subject to the dynamics equations (17.6), the output equations (17.7), 
and additional linear equality constraints. (In ‘linear quadratic’, ‘linear’ refers to 
the linear dynamics, and ‘quadratic’ refers to the objective function, which is a 
sum of squares.) 

Most control problems include an initial state constraint, which has the form 
xı =x™t where g} is a given initial state. Some control problems also include a 
final state constraint xp = x9, where x** is a given (‘desired’) final (also called 
terminal or target) state. 

The objective function has the form J = Joutput + PJinput, Where 


Joutpue = lal? +--+ + Ilyrll? = Cill? +--+ + Crer], 


Finput = ual? +--+ Mural. 


The positive parameter p weights the input objective Jinput relative to the output 
objective Joutput- 
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The linear quadratic control problem (with initial and final state constraints) is 


minimize Joutput + PYinput 
subject to £ii = Aixi + Bey, t=1,...,7-—1, (17.8) 
“= git TT = ies, 
where the variables to be chosen are z1,..., £r and wy1,...,u7_1. 


Formulation as constrained least squares problem. We can solve the linear 
quadratic control problem (17.8) by setting it up as a big linearly constrained 
least squares problem. We define the vector z of all these variables, stacked: 


z= (@1,...,@7,U1,..-,UT-1). 


The dimension of z is Tn + (T — 1)m. The control objective can be expressed as 
|| Az — b||?, where b = 0 and A is the block matrix 


Cı 
C2 


D 
II 


Cr 


pI 


vPI 


In this matrix, (block) entries not shown are zero, and the identity matrices in 
the lower right corner have dimension m. (The lines in the matrix delineate the 
portions related to the states and the inputs.) The dynamics constraints, and the 
initial and final state constraints, can be expressed as Čz = d, with 


A -I Bı 0 
A -I Bə 0 
Č = bi d = ? 
Ap; —I Br-1 0 
I girit 
I gles 


where (block) entries not shown are zero. (The vertical line separates the portions 
of the matrix associated with the states and the inputs, and the horizontal lines 
separate the dynamics equations and the initial and final state constraints.) 

The solution 2 of the constrained least squares problem 


minimize || Az — b||? 


17.9 
subject to Cz=d ( ) 


gives us the optimal input trajectory and the associated optimal state (and output) 
trajectory. The solution 2 is a linear function of b and d; since here b = 0, it is a 
linear function of r®* and z3ss, 
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17.2.1 


17.2.2 


Complexity. The large constrained least squares problem (17.9) has dimensions 
a=Ta+(T-1)m, M=Tp+(T-—1)m, p=(T—1)n+2n, 
so using one of the standard methods described in §16.2 would require order 
(B+ MR? ~T?(m+p+n)(m+n)?, 


flops, where the symbol ~ means we have dropped terms with smaller exponents. 
But the matrices A and C are very sparse, and by exploiting this sparsity (see 
page 349), the large constrained least squares problem can be solved in order T(m-+ 
p+n)(m+n)? flops, which grows only linearly in T. 


Example 


We consider the time-invariant linear dynamical system with 


0.855 1.161 0.667 —0.076 
A= 0.015 1.073 0.053 |, B= | —0.139 |, 
—0.084 0.059 1.022 0.342 


C = | 0.218 —3.597 —1.683 |, 


with initial condition xi™t = (0.496, —0.745, 1.394), target or desired final state 
ates — 0, and T = 100. In this example, both the input u; and the output y have 
dimension one, i.e., are scalar. Figure 17.4 shows the output when the input is 
Zero, 
ye = CA 12% ¢=1,...,T. 

which is called the open-loop output. Figure 17.5 shows the optimal trade-off curve 
of the objectives Jinput and Joutput, found by varying the parameter p, solving the 
problem (17.9), and evaluating the objectives Jinput and Joutput- The points corre- 
sponding to the values p = 0.05, p = 0.2, and p = 1 are shown as circles. As always, 
increasing p has the effect of decreasing Jinput, at the cost of increasing Joutput- 

The optimal input and output trajectories for these three values of p are shown 
in figure 17.6. Here too we see that for larger p, the input is smaller but the output 
is larger. 


Variations 


There are many variations on the basic linear quadratic control problem described 
above. We describe some of them here. 

Tracking. We replace y in Joutput with ys — ys, where yg% is a given desired 
output trajectory. In this case the objective function Joutput is called the tracking 
error. Decreasing the parameter p leads to better output tracking, at the cost 
of larger input trajectory. This variation on the linear quadratic control problem 
can be expressed as a linearly constrained least squares problem with the same 
big matrices A and C, the same vector d, and a nonzero vector b. The desired 


trajectory ył% appears in the vector b. 
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Output 


Figure 17.4 Open-loop response C'A’ta'™"*, 


Joutput 


Jinput 


Figure 17.5 Optimal trade-off curve of the objectives Jinput and Joutput- 
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Figure 17.6 Optimal inputs (left) and outputs (right) for p = 0.05 (top), 


p = 0.2 (center), and p = 1 (bottom). 
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Time-weighted objective. We replace Joutput with 
Joutput = Wi|lyal]? +--+ + wrllyrll?, 


where w1,..., wp are given positive constants. This allows us to weight earlier or 
later output values differently. A common choice, called exponential weighting, is 
w; = 0t, where 0 > 0. For 0 > 1 we weight later values of y more than earlier 
values; the opposite is true for 6 < 1 (in which case @ is sometimes called the 
discount or forgetting factor). 


Way-point constraints. A way-point constraint specifies that y, = y”?, where 
y™P is a given p-vector, and 7 is a given way-point time. This constraint is typically 
used when y; represents a position of a vehicle; it requires that the vehicle pass 
through the position y*? at time t = 7. Way-point constraints can be expressed as 
linear equality constraints on the big vector z. 


Linear state feedback control 


In the linear quadratic control problem we work out a sequence of inputs u1, ..., UT—1 
to apply to the system, by solving the constrained least squares problem (17.8). It 
is typically used in cases where t = T has some significance, like the time of landing 
or docking for a vehicle. 

We have already mentioned (on page 185) another simpler approach to the 
control of a linear dynamical system. In linear state feedback control we measure 
the state in each period and use the input 


Ut = Kr 


for t = 1,2,.... The matrix K is called the state feedback gain matrix. State 
feedback control is very widely used in practical applications, especially ones where 
there is no fixed future time T when the state must take on some desired value; 
instead, it is desired that both x+, and us should be small and converge to zero. One 
practical advantage of linear state feedback control is that we can find the state 
feedback matrix K ahead of time; when the system is operating, we determine 
the input values using one simple matrix-vector multiply. Here we show how an 
appropriate state feedback gain matrix K can be found using linear quadratic 
control. 

Let 2 denote the solution of the linear quadratic control problem, i.e., the 
solution of the linearly constrained least squares problem (17.8), with «4° = 0. 
The solution 2 is a linear function of v'™* and «4°; since here x? = 0, 2 is a linear 
function of zit = x,. Since a, the optimal input at t = 1, is a slice or subvector of 
2, we conclude that tw, is a linear function of xı, and so can be written as u = K2, 
for some m x n matrix K. The columns of K can be found by solving (17.8) with 
initial conditions zi™t = e),...,e,. This can be done efficiently by factoring the 
coefficient matrix once, and then carrying out n solves. 

This matrix generally provides a good choice of state feedback gain matrix. 
With this choice, the input u; with state feedback control and under linear quadratic 
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Figure 17.7 The blue curves are the solutions of (17.8) for p = 1. The 
red curves are the inputs and outputs that result from the constant state 
feedback uz = Kat. 


control are the same; for t > 1, the two inputs differ. An interesting phenomenon, 
beyond the scope of this book, is that the state feedback gain matrix K found this 
way does not depend very much on T, provided it is chosen large enough. 


Example. For the example in §17.2.1 the state feedback gain matrix for p = 1 is 


K = | 0.308 —2.659 —1.446 ]. 


In figure 17.7, we plot the trajectories with linear quadratic control (in blue) and 
using the simpler linear state feedback control u, = Kx. We can see that the input 
sequence found using linear quadratic control achieves yr = 0 exactly; the input 
sequence found by linear state feedback control makes yr small, but not zero. 


Linear quadratic state estimation 


The setting is a linear dynamical system of the form 


Tti = Arx + Brews, Y= Cyt, +, t=1,2,.... (17.10) 
Here the n-vector x; is the state of the system, the p-vector y+ is the measurement, 
the m-vector w; is the input or process noise, and the p-vector v; is the measurement 
noise or residual. The matrices A+, B+, and C; are the dynamics, input, and output 
matrices, respectively. 

In state estimation, we know the matrices A;, B;, and C; over the time period 
t = 1,...,7, as well as the measurements y),...,yr, but we do not know the 


process or measurement noises. The goal is to guess or estimate the state sequence 
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%4,...,@7. State estimation is widely used in many application areas, including all 
guidance and navigation systems, such as the Global Positioning System (GPS). 

Since we do not know the process or measurement noises, we cannot exactly 
deduce the state sequence. Instead we will guess or estimate the state sequence 
T1,- 2r and process noise sequence w1,...,Wp—1, subject to the requirement 
that they satisfy the dynamic system model (17.10). When we guess the state 
sequence, we implicitly guess that the measurement noise is v = yt — Cyxz. We 
make one fundamental assumption: The process and measurement noises are both 
small, or at least, not too large. 

Our primary objective is the sum of squares of the norms of the measurement 
residuals, 


Jmeas = |1? + +++ + orl? = [Cita = ll? +--+ + [Crear — yrl. 


If this quantity is small, it means that the proposed state sequence guess is consis- 
tent with our measurements. Note that the quantities in the squared norms above 
are the same as —v;. 

The secondary objective is the sum of squares of the norms of the process noise, 


Jproc = lwll? apt Sse ||wr_1||?. 


Our prior assumption that the process noise is small corresponds to this objective 
being small. 


Least squares state estimation. We will make our guesses of 21,...,a7 and 
W1,---,Wp_1 SO as to minimize a weighted sum of our objectives, subject to the 
dynamics constraints: 


minimize  Jmeas + AJproc 


subject to z1 = Arzi + Bewe, t=1,...,T—1, (17.11) 


where A is a positive parameter that allows us to put more emphasis on making 
our measurement discrepancies small (by choosing A small), or the process noises 
small (by choosing A large). Roughly speaking, small A means that we trust the 
measurements more, while large A means that we trust the measurements less, and 
put more weight on choosing a trajectory consistent with the dynamics, with small 
process noise. We will see later how A can be chosen using validation. 


Estimation versus control. The least squares state estimation problem is very 
similar to the linear quadratic control problem, but the interpretation is quite 
different. In the control problem, we can choose the inputs; they are under our 
control. Once we choose the inputs, we know the state sequence. The inputs 
are typically actions that we take to affect the state trajectory. In the estimation 
problem, the inputs (called process noise in the estimation problem) are unknown, 
and the problem is to guess them. Our job is to guess the state sequence, which 
we do not know. This is a passive task. We are not choosing inputs to affect 
the state; rather, we are observing the outputs and hoping to deduce the state 
sequence. The mathematical formulations of the two problems, however, are very 
closely related. The close connection between the two problems is sometimes called 
control/estimation duality. 
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Formulation as constrained least squares problem. The least squares state esti- 
mation problem (17.11) can be formulated as a linearly constrained least squares 
problem, using stacking. We define the stacked vector 

z = (@1,...,027,W1,...,Wr-1). 
The objective in (17.11) can be expressed as ||Ãz — b||?, with 
Ci yı 
Co Y2 
A= Cr ; b= | yr 
VAI 0 
VAI 0 
The constraints in (17.11) can be expressed as Cz = d, with d = 0 and 
Aj —I Bı 
` A —I Bə 
C = 
Arı —I Br-1 
The constrained least squares problem has dimensions 
rm=Tn+(T-l)m, m=Tpt+(T-l1)m, p=(T-1)n 
so using one of the standard methods described in §16.2 would require order 
(+m)? = T?(m+p+n)(m+n)* 
flops. As in the case of linear quadratic control, the matrices A and C are very 
sparse, and by exploiting this sparsity (see page 349), the large constrained least 
squares problem can be solved in order T(m + p + n)(m + n)? flops, which grows 
only linearly in T. 

The least squares state estimation problem was formulated in around 1960 by 
Rudolf Kalman and others (in a statistical framework). He and others developed a 
particular recursive algorithm for solving the problem, and the whole method has 
come to be known as Kalman filtering. For this work Kalman was awarded the 
Kyoto Prize in 1985. 

17.3.1 Example 


We consider a system with n = 4, p = 2, and m = 2, and time-invariant matrices 


101 0 0 0 
0101 0 0 10 0 0 

Oe ole Plaats Shoira] 
0001 01 
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This is a very simple model of motion of a mass moving in 2-D. The first two 
components of x; represent the position coordinates; components 3 and 4 represent 
the velocity coordinates. The input w acts like a force on the mass, since it adds to 
the velocity. We think of the 2-vector Ca; as the exact or true position of the mass 
at period t. The measurement y = Cx; + v; is a noisy measurement of the mass 
position. We will estimate the state trajectory over t =1,...,7, with T = 100. 

In figure 17.8 the 100 measured positions y, are shown as circles in 2-D. The 
solid black lines show Cx, i.e., the actual position of the mass. We solve the least 
squares state estimation problem (17.11) for a range of values of A. The estimated 
trajectories Cz, for three values of A are shown as blue lines. We can see that A = 1 
is too small for this example: The estimated state places too much trust in the 
measurements, and is following measurement noise. We can also see that A = 10° 
is too large: The estimated state is very smooth (since the estimated process noise 
is small), but the imputed noise measurements are too high. In this example the 
choice of A is simple, since we have the true position trajectory. We will see later 
how A can be chosen using validation in the general case. 


Variations 


Known initial state. There are several interesting variations on the state estima- 
tion problem. For example, we might know the initial state xı. In this case we 
simply add an equality constraint zı = rKrown, 


Missing measurements. Another useful variation on the least squares state esti- 
mation problem allows for missing measurements, i.e., we only know y: for t € T, 
where 7 is the set of times for which we have a measurement. We can han- 
dle this variation two (equivalent) ways: We can either replace + \|ve||? with 
rer |lvell?, or we can consider y; for t ¢ T to be optimization variables as well. 
(Both lead to the same state sequence estimate.) When there are missing mea- 
surements, we can estimate what the missing measurements might have been, by 
taking 


Ut = Cit, t g T. 


(Here we assume that v; = 0.) 


Validation 


The technique of estimating what a missing measurement might have been directly 
gives us a method to validate a quadratic state estimation method, and in partic- 
ular, to choose A. To do this, we remove some of the measurements (say, 20%), 
and carry out least squares state estimation pretending that those measurements 
are missing. Our state estimate produces predicted values for the missing (really, 
held back) measurements, which we can compare to the actual measurements. We 
choose a value of A which approximately minimizes this (test) prediction error. 
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Figure 17.8 The circles show 100 noisy measurements in 2-D. The solid 
black line in each plot is the exact position Cx. The blue lines in plots 2—4 
are estimated trajectories C'%; for three values of A. 
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80 = 


RMS error 


1078 10-1 10t 10° 10° 


Figure 17.9 Training and test errors for the state estimation example. 


Example. Continuing the previous example, we randomly remove 20 of the 100 
measurement points. We solve the same problem (17.11) for a range of values of 
A, but with Jmeas defined as 


Jmeas = 5 |C z: z yell, 
teT 


i.e., we only sum the measurement errors over the measurements we have. For each 
value of A we compute the RMS train and test errors 


1/2 


1/2 
1 - 2 1 M 2 
Frrain = V80p (= Cê; — yell ’ Erost = V20p 5 Cê; — yll 


tET tT 


The training error (squared and scaled) appears directly in our minimization prob- 
lem. The test error, however, is a good test of our estimation method, since it 
compares predictions of positions (in this example) with measurements of position 
that were not used to form the estimates. The errors are shown in figure 17.9, as 
functions of the parameter A. We can clearly see that for A < 100 or so, we are 
over-fit, since the test RMS error substantially exceeds the train RMS error. We 
can also see that À around 10° is a good choice. 
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17.4 


Exercises 


A variation on the portfolio optimization formulation. Consider the following variation 
on the linearly constrained least squares problem (17.2): 


minimize || Rw||? 


T 
subject to | va J = | : | ; (17.12) 


with variable w. (The difference is that here we drop the term p1 that appears inside the 
norm square objective in (17.2).) Show that this problem is equivalent to (17.2). This 
means w is a solution of (17.12) if and only if it is a solution of (17.2). 

Hint. You can argue directly by expanding the objective in (17.2) or via the KKT systems 
of the two problems. 


A more conventional formulation of the portfolio optimization problem. In this problem 
we derive an equivalent formulation of the portfolio optimization problem (17.2) that 
appears more frequently in the literature than our version. (Equivalent means that the 
two problems always have the same solution.) This formulation is based on the return 
covariance matriz, which we define below. (See also exercise 10.16.) 

The means of the columns of the asset return matrix R are the entries of the vector p. 
The de-meaned returns matrix is given by R = R — 1u”. (The columns of the matrix 
Ř= R-— 1p” are the de-meaned return time series for the assets.) The return covariance 
matrix, traditionally denoted X, is its Gram matrix © = (1/T) RT R. 


(a) Show that o; = Xj; is the standard deviation (risk) of asset i return. (The symbol 
g; is a traditional one for the standard deviation of asset i.) 


(b) Show that the correlation coefficient between asset i and asset j returns is given by 
pij = Uij/(oi0;). (Assuming neither asset has constant return; if either one does, 
they are uncorrelated.) 


(c) Portfolio optimization using the return covariance matrix. Show that the following 
problem is equivalent to our portfolio optimization problem (17.2): 


minimize wTEw 


T 
subject to | uT |e = | : | ; (17.13) 


with variable w. This is the form of the portfolio optimization problem that you 
will find in the literature. Hint. Show that the objective is the same as ||Rw||?, and 
that this is the same as ||Rw — p1||° for any feasible w. 


A simple portfolio optimization problem. 


(a) Find an analytical solution for the portfolio optimization problem with n = 2 assets. 
You can assume that ju1 Æ u2, i.e., the two assets have different mean returns. Hint. 
The optimal weights depend only on u and p, and not (directly) on the return 
matrix R. 


(b) Find the conditions under which the optimal portfolio takes long positions in both 
assets, a short position in one and a long position in the other, or a short position 
in both assets. You can assume that yi < u2, i.e., asset 2 has the higher return. 
Hint. Your answer should depend on whether p < pı, Wi < P < H2, Or u2 < p, i.e., 
how the required return compares to the two asset returns. 


Index tracking. Index tracking is a variation on the portfolio optimization problem de- 
scribed in §17.1. As in that problem we choose a portfolio allocation weight vector w that 
satisfies 17w = 1. This weight vector gives a portfolio return time series Rw, which is 
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a T-vector. In index tracking, the goal is for this return time series to track (or follow) 
as closely as possible a given target return time series r*®. We choose w to minimize 
the RMS deviation between the target return time series r*®™ and the portfolio return 
time series r. (Typically the target return is the return of an index, like the Dow Jones 
Industrial Average or the Russell 3000.) Formulate the index tracking problem as a lin- 
early constrained least squares problem, analogous to (17.2). Give an explicit solution, 
analogous to (17.3). 


Portfolio optimization with market neutral constraint. In the portfolio optimization prob- 
lem (17.2) the portfolio return time series is the T-vector Rw. Let r™*t denote the 
T-vector that gives the return of the whole market over the time periods t = 1,..., T. 
(This is the return associated with the total value of the market, i.e., the sum over the 
assets of asset share price times number of outstanding shares.) A portfolio is said to be 
market neutral if Rw and r™** are uncorrelated. 


Explain how to formulate the portfolio optimization problem, with the additional con- 
straint of market neutrality, as a constrained least squares problem. Give an explicit 
solution, analogous to (17.3). 


State feedback control of the longitudinal motions of a Boeing 747 aircraft. In this exercise 
we consider the control of the longitudinal motions of a Boeing 747 aircraft in steady level 
flight, at an altitude of 40000 ft, and speed 774 ft/s, which is 528 MPH or 460 knots, 
around Mach 0.8 at that altitude. (Longitudinal means that we consider climb rate and 
speed, but not turning or rolling motions.) For modest deviations from these steady 
state or trim conditions, the dynamics is given by the linear dynamical system 241 = 
Azt + But, with 


| 0.99 0.03 —0.02 —0.32 ] | 0.01 0.99 ] 

A= 0.01 0.47 4.70 0.00 B= —3.44 1.66 
0.02 —0.06 0.40 0.00 |’ —0.83 0.44 |” 
0.01 —0.04 0.72 0.99 —0.47 0.25 


with time unit one second. The state 4-vector x; consists of deviations from the trim 
conditions of the following quantities. 


e (x1)1 is the velocity along the airplane body axis, in ft/s, with forward motion 
positive. 

e (xz)2 is the velocity perpendicular to the body axis, in ft/s, with positive down. 

e (x1)3 is the angle of the body axis above horizontal, in units of 0.01 radian (0.57°). 

e (xz) is the derivative of the angle of the body axis, called the pitch rate, in units of 
0.01 radian/s (0.57°/s). 


The input 2-vector us (which we can control) consists of deviations from the trim condi- 
tions of the following quantities. 


e (uz)1 is the elevator (control surface) angle, in units of 0.01 radian. 


e (uz)2 is the engine thrust, in units of 10000 Ibs. 


You do not need to know these details; we mention them only so you know what the 
entries of x; and uz mean. 


(a) Open loop trajectory. Simulate the motion of the Boeing 747 with initial condition 
zı = ea, in open-loop (i.e., with uz = 0). Plot the state variables over the time 
interval t = 1,...,120 (two minutes). The oscillation you will see in the open-loop 
simulation is well known to pilots, and called the phugoid mode. 


(b) Linear quadratic control. Solve the linear quadratic control problem with C = J, 


p = 100, and T = 100, with initial state xı = e4, and desired terminal state x% = 0. 
Plot the state and input variables over t = 1,...,120. (For t = 100,...,120, the 
state and input variables are zero.) 
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(c) Find the 2 x 4 state feedback gain K obtained by solving the linear quadratic control 
problem with C = J, p = 100, T = 100, as described in §17.2.3. Verify that it is 
almost the same as the one obtained with T = 50. 


(d) Simulate the motion of the Boeing 747 with initial condition zı = e4, under state 
feedback control (i.e., with wz = Kaz). Plot the state and input variables over the 
time interval t = 1,...,120. 


17.7 Bio-mass estimation. A bio-reactor is used to grow three different bacteria. We let x 
be the 3-vector of the bio-masses of the three bacteria, at time period (say, hour) t, for 
t=1,...,7. We believe that they each grow, independently, with growth rates given by 
the 3-vector r (which has positive entries). This means that (+41); © (1 + ri)(xe)i, for 
i = 1,2,3. (These equations are approximate; the real rate is not constant.) At every 
time sample we measure the total bio-mass in the reactor, i.e., we have measurements 


ye 17? a, fort =1,...,T. (The measurements are not exactly equal to the total mass; 
there are small measurement errors.) We do not know the bio-masses x1, ..., £r, but wish 
to estimate them based on the measurements y1,..., YT. 


Set this up as a linear quadratic state estimation problem as in $17.3. Identify the 
matrices A;, B, and C;. Explain what effect the parameter A has on the estimated 
bio-mass trajectory %1,...,@7. 
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Chapter 18 


Nonlinear least squares 


In previous chapters we studied the problems of solving a set of linear equations 
or finding a least squares approximate solution to them. In this chapter we study 
extensions of these problems in which linear is replaced with nonlinear. These 
nonlinear problems are in general hard to solve exactly, but we describe a heuristic 
algorithm that often works well in practice. 


Nonlinear equations and least squares 


Nonlinear equations 


Consider a set of m possibly nonlinear equations in n unknowns (or variables) 
v= (#1,...,%p), written as 


where f; : R” — R is a scalar-valued function. We refer to f;(x) = 0 as the ith 
equation. For any x we call f;(a) the ith residual, since it is a quantity we want to 
be zero. Many interesting practical problems can be expressed as the problem of 
solving, possibly approximately, a set of nonlinear equations. 

We take the right-hand side of the equations to be zero to simplify the problem 
notation. If we need to solve f;(x) = bj, i = 1,...,m, where b; are some given 
nonzero numbers, we define f(x) = fi(£) — bi, and solve f,(x) = 0, i =1,...,m, 
which gives us a solution of the original equations. Assuming the right-hand sides 
of the equations are zero will simplify formulas and equations. 

We often write the set of equations in the compact vector form 


f(x) =0, (18.1) 


where f(x) = (fi(x),..., fm(a)) is an m-vector, and the zero vector on the right- 
hand side has dimension m. We can think of f as a function that maps n-vectors 
to m-vectors, i.e., f : R” > R™. We refer to the m-vector f(x) as the residual 
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(vector) associated with the choice of the n-vector x; our goal is to find x with 
associated residual zero. 

When f is an affine function, the set of equations (18.1) is a set of m linear 
equations in n unknowns, which can be solved (or approximately solved in a least 
squares sense when m > n), using the techniques covered in previous chapters. We 
are interested here in the case when f is not affine. 

We extend the ideas of under-determined, square, and over-determined equa- 
tions to the nonlinear case. When m < n, there are fewer equations than unknowns, 
and the system of equations (18.1) is called under-determined. When m = n, so 
there are as many equations as unknowns, the system of equations is called square. 
When m > n, there are more equations than unknowns, and the system of equa- 
tions is called over-determined. 


Nonlinear least squares 


When we cannot find a solution of the equations (18.1), we can seek an approximate 
solution, by finding x that minimizes the sum of squares of the residuals, 


fila)? +--+ fal? = IEP. 


This means finding ĉ for which ||f(x)||? > ||f(£)||? holds for all z. We refer to 
such a point as a least squares approximate solution of (18.1), or more directly, as 
a solution of the nonlinear least squares problem 


minimize || f(x)”, (18.2) 


where the n-vector x is the variable to be found. When the function f is affine, the 
nonlinear least squares problem (18.2) reduces to the (linear) least squares problem 
from chapter 12. 

The nonlinear least squares problem (18.2) includes the problem of solving 
nonlinear equations (18.1) as a special case, since any x that satisfies f(x) = 0 is 
also a solution of the nonlinear least squares problem. But as in the case of linear 
equations, the least squares approximate solution of a set of nonlinear equations is 
often very useful even when it does not solve the equations. So we will focus on 
the nonlinear least squares problem (18.2). 


Optimality condition 


Calculus gives us a necessary condition for ĉ to be a solution of (18.2), i.e., to 
minimize || f(a)||?.. (This means that the condition must hold for a solution, but 
it may also hold for other points that are not solutions.) The partial derivative of 
\| f(x) ||? with respect to each of x1,..., £n must vanish at 2: 


o 
Ox; 


II? =0, i=1,...,n, 
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or, in vector form, V||f(@)||? = 


Vilf(@)I? = v (Sone S )V fila) = 2Df (x)? f(a), 


where the m x n matrix Df (x) is the derivative or Jacobian matrix of the function 
f at the point x, i.e., the matrix of its partial derivatives (see §8.2.1 and C.1). So 
if minimizes || f(a)||?, it must satisfy 


2Df(&)* f(z) = 0. (18.3) 


This optimality condition must hold for any solution of the nonlinear least squares 
problem (18.2). But the optimality condition can also hold for other points that 
are not solutions of the nonlinear least squares problem. For this reason the opti- 
mality condition (18.3) is called a necessary condition for optimality, because it is 
necessarily satisfied for any solution ĉ. It is a not a sufficient condition for opti- 
mality, since the optimality condition (18.3) is not enough (i.e., is not sufficient) 
to guarantee that the point is a solution of the nonlinear least squares problem. 

When the function f is affine, the optimality conditions (18.3) reduce to the 
normal equations (12.4), the optimality conditions for the (linear) least squares 
problem. 


0 (see §C.2). This gradient can be expressed as 


Difficulty of solving nonlinear equations 


Solving a set of nonlinear equations (18.1), or solving the nonlinear least squares 
problem (18.2), is in general much more difficult than solving a set of linear equa- 
tions or a linear least squares problem. For nonlinear equations, there can be no 
solution, or any number of solutions, or an infinite number of solutions. Unlike 
linear equations, it is a very difficult computational problem to determine which 
one of these cases holds for a particular set of equations; there is no analog of the 
QR factorization that we can use for linear equations and least squares problems. 
Even the simple sounding problem of determining whether or not there are any 
solutions to a set of nonlinear equations is very difficult computationally. There 
are advanced non-heuristic algorithms for exactly solving nonlinear equations, or 
exactly solving nonlinear least squares problems, but they are complicated and 
very computationally demanding, and rarely used in applications. 

Given the difficulty of solving a set of nonlinear equations, or solving a nonlinear 
least squares problem, we must lower our expectations. We can only hope for an 
algorithm that often finds a solution (when one exists), or produces a value of x 
with small residual norm, if not the smallest that is possible. Algorithms like this, 
that often work, or tend to produce a good if not always the best possible point, are 
called heuristics. The k-means algorithm of chapter 4 is an example of a heuristic 
algorithm. Solving linear equations or linear least squares problems using the QR 
factorization are not heuristics; these algorithms always work. 

Many heuristic algorithms for the nonlinear least squares problem, including 
those we describe later in this chapter, compute a point ĉ that satisfies the op- 
timality condition (18.3). Unless f(@) = 0, however, such a point need not be a 
solution of the nonlinear least squares problem (18.2). 
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Examples 


In this section we list a few applications that reduce to solving a set of nonlinear 
equations, or a nonlinear least squares problem. 


Computing equilibrium points. The idea of an equilibrium, where some type 
of consumption and generation balance each other, arises in many applications. 
Consumption and generation depend, often nonlinearly, on the values of some pa- 
rameters, and the goal is to find values of the parameters that lead to equilibrium. 
These examples typically have m = n, i.e., the system of nonlinear equations is 
square. 


e Equilibrium prices. We consider n commodities or goods, with associated 
prices given by the n-vector p. The demand for the n goods (an n-vector) is 
a nonlinear function of the prices, given by D(p). (In an example on page 150 
we described an approximate model for demand that is accurate when the 
prices change from nominal values by a few percent; here we consider the 
demand over a large range of prices.) The supply of the goods (an n-vector) 
also depends on the prices, and is given by S(p). (When the price for a good 
is high, for example, more producers are willing to produce it, so the supply 
increases. ) 


A set of commodity prices p is an equilibrium price vector if it results in 
supply balancing demand, i.e., S(p) = D(p). Finding a set of equilibrium 
prices is the same as solving the square set of nonlinear equations 


(The vector f(p) is called the excess supply, at the set of prices p.) This is 
shown in figure 18.1 for a simple case with n = 1. 


e Chemical equilibrium. We consider n chemical species in a solution. The 
n-vector c denotes the concentrations of the n species. Reactions among 
the species consume some of them (the reactants) and generate others (the 
products). The rate of each reaction is a function of the concentrations of 
its reactants (and other parameters we assume are fixed, like temperature or 
presence of catalysts). We let C(c) denote the vector of total consumption of 
the n reactants, over all the reactions, and we let G(c) denote the vector of 
generation of the n reactants, over all reactions. 


A concentration vector c is in chemical equilibrium if C(c) = G(c), i.e., the 
rate of consumption of all species balances the rate of generation. Computing 
a set of equilibrium concentrations is the same as solving the square set of 
nonlinear equations 


e Mechanical equilibrium. A mechanical system in 3-D with N nodes is char- 
acterized by the positions of the nodes, given by a 3N-vector q of the stacked 
node positions, called the generalized position. The net force on each node is 
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D(p) 
p 


Figure 18.1 Supply and demand as functions of the price, shown on the hori- 
zontal axis. They intersect at the point shown as a circle. The corresponding 
price is the equilibrium price. 


a 3-vector, which depends on q, i.e., the node positions. We describe this as 
a 3N-vector of forces, F(q). 


The system is in mechanical equilibrium if the net force on each node is zero, 
i.e., F(q) = 0, a set of 3N nonlinear equations in 3N unknowns. (A more 
complex mechanical equilibrium model takes into account angular displace- 
ments and torques at each node.) 


Nash equilibrium. We consider a simple setup for a mathematical game. Each 
of n competing agents or participants chooses a number x;. Each agent is 
given a (numerical) reward (say, money) that depends not only on her own 
choice, but also on the choice of all the other agents. The reward for agent i 
is given by the function R;(a), called the payoff function. Each agent wishes 
to make a choice that maximizes her reward. This is complicated since the 
reward depends not only on her choice, but the choices of the other agents. 


A Nash equilibrium (named after the mathematician John Forbes Nash, Jr.) 
is a set of choices given by the n-vector x where no agent can improve (in- 
crease) her reward by changing her choice. Such a choice is argued to be 
‘stable’ since no agent is incented to change her choice. At a Nash equilib- 
rium zx; maximizes R;(x), so we must have 


OR; 
Ox; 


(cz) =0, t=1,...,n. 


This necessary condition for a Nash equilibrium is a square set of nonlinear 
equations. 
The idea of a Nash equilibrium is widely used in economics, social science, 


and engineering. Nash was awarded the Nobel Prize in economics for this 
work in 1994. 
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Nonlinear least squares examples. Nonlinear least squares problems arise in 
many of the same settings and applications as linear least squares problems. 


e Location from range measurements. The 3-vector (or 2-vector) x represents 
the location of some object or target in 3-D (or 2-D), which we wish to 
determine or guess. We are given m range measurements, i.e., the distance 
from x to some known locations @1,...,@m, 


pi = |[z — ai|| + vi, @=1,...,m, 


where v; is an unknown measurement error, assumed to be small. Our esti- 
mate ĉ of the location is found by minimizing the sum of the squares of the 


range residuals, 
m 


2 
XC (lle = ail = pi)”. 
i=1 
A similar method is used in GPS devices, where a; are the known locations 
of GPS satellites that are in view. 


e Nonlinear model fitting. We consider a model y ~ f(x;0) where x denotes 
a feature vector, y a scalar outcome, f a model of some function relating x 
and y, and @ is a vector of model parameters that we seek. In chapter 13, f 
is an affine function of the model parameter p-vector 0; but here it need not 
be. As in chapter 13, we choose the model parameter by minimizing the sum 
of the squares of the residuals over a data set with N examples, 


N 

> ea — yy. (18.4) 

i=1 
(As in linear least squares model fitting, we can add a regularization term 
to this objective function.) This is a nonlinear least squares problem, with 
variable 0. 


Gauss—Newton algorithm 


In this section we describe a powerful heuristic algorithm for the nonlinear least 
squares problem (18.2) that bears the names of the two famous mathematicians 
Carl Friedrich Gauss and Isaac Newton. In the next section we will also describe a 
variation of the Gauss-Newton algorithm known as the Levenberg-Marquardt algo- 
rithm, which addresses some shortcomings of the basic Gauss-Newton algorithm. 


The Gauss-Newton and Levenberg—Marquardt algorithms are iterative algo- 
rithms that generate a sequence of points 2“), 2®),.... The vector x is called 
the starting point of the algorithm, and x‘) is called the kth iterate. Moving from 
a) to v®+) is called an iteration of the algorithm. We judge the iterates by 
the norm of the associated residuals, ||f(a*))||, or its square. The algorithm is 
terminated when || f(2*))|| is small enough, or «(*+) is very near a), or when a 
maximum number of iterations is reached. 


18.2.1 


18.2 Gauss—Newton algorithm 
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Basic Gauss—Newton algorithm 


The idea behind the Gauss—Newton algorithm is simple: We alternate between 
finding an affine approximation of the function f at the current iterate, and then 
solving the associated linear least squares problem to find the next iterate. This 
combines two of the most powerful ideas in applied mathematics: Calculus is used 
to form an affine approximation of a function near a given point, and least squares 
is used to compute an approximate solution of the resulting affine equations. 

We now describe the algorithm in more detail. At each iteration k, we form 
the affine approximation f of f at the current iterate x(*), given by the Taylor 
approximation 

fe) = fe™)+ Die) @—2), (18.5) 
where the mxn matrix D f(a) is the Jacobian or derivative matrix of f (see §8.2.1 
and §C.1). The affine function f(a; a) is a very good approximation of f(x) 
provided a is near x“), i.e., ||a — 2)|| is small. 

The next iterate c+) is then taken to be the minimizer of || f(a; x))||?, the 
norm squared of the affine approximation of f at 2“). Assuming that the derivative 
matrix Df (x‘")) has linearly independent columns (which requires m > n), we have 


ght) — g% _ (Dre)? DF(e)) Df (x)? fo). (18.6) 


This iteration gives the basic Gauss-Newton algorithm. 


Algorithm 18.1 BASIC GAuss-NEWTON ALGORITHM FOR NONLINEAR LEAST SQUARES 
given a differentiable function f : R” > R”, an initial point «™. 
For k= 1,2,...,k™°* 


1. Form affine approximation at current iterate using calculus. Evaluate the Ja- 
cobian Df (x) and define 


fax) = se) De e=), 


2. Update iterate using linear least squares. Set ct) as the minimizer of 
lieie™)/P, 


ot) — o Dfe DE) Df (a)? fo). 


The Gauss—Newton algorithm is terminated early if f(x) is very small, or a+) x 
a*), Tt terminates with an error if the columns of Df («‘")) are linearly dependent. 

The condition «+ = x) (the exact form of our stopping condition) holds 
when 


(Die)? Fe)” DF Fa) = 0, 


which occurs if and only if Df(«*))7 f(x) = 0 (since we assume that D f(x‘) 
has linearly independent columns). So the Gauss—Newton algorithm stops only 
when the optimality condition (18.3) holds. 
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18.2.2 


We can also observe that 
Fer <P < foe 1? = Ife)? (18.7) 


holds, since «+! minimizes || f (a; a)|?, and f(a; 2) = f(x). The norm 
of the residual of the approximation goes down in each iteration. This is not the 
same as 

FMP < Fe), (18.8) 


i.e., the norm of the residual goes down in each iteration, which is what we would 
like. 


Shortcomings of the basic Gauss—Newton algorithm. We will see in examples 
that the Gauss-Newton algorithm can work well, in the sense that the iterates 
x *) converge very quickly to a point with small residual. But the Gauss-Newton 
algorithm has two related serious shortcomings. 

The first is that it can fail, by producing a sequence of points with the norm of 
the residual || f(x“) || increasing to large values, as opposed to decreasing to a small 
value, which is what we want. (In this case the algorithm is said to diverge.) The 
mechanism behind this failure is related to the difference between (18.7) and (18.8). 
The approximation . 

IE = Aaa) 
is guaranteed to hold only when x is near 2“), So when «*+") is not near x), 
| f(a@*t))]/2 and || f(a); «)||? can be very different. In particular, the (true) 
residual at «+ can be larger than the residual at x‘). 

The second serious shortcoming of the basic Gauss-Newton algorithm is the 
assumption that the columns of the derivative matrix Df (x) are linearly inde- 
pendent. In some applications, this assumption never holds; in others, it can fail to 
hold at some iterate 7), in which case the Gauss-Newton algorithm stops, since 
a+) is not defined. 

We will see that a simple modification of the Gauss—Newton algorithm, de- 
scribed below in §18.3, addresses both of these shortcomings. 


Newton algorithm 


For the special case m = n, the Gauss-Newton algorithm reduces to another famous 
algorithm for solving a set of n nonlinear equations in n variables, called the Newton 
algorithm. (The algorithm is sometimes called the Newton-Raphson algorithm, 
since Newton developed the method only for the special case n = 1, and Joseph 
Raphson later extended it to the case n > 1.) 

When m = n, the matrix Df (a) is square, so the basic Gauss-Newton up- 
date (18.6) can be simplified to 


sD = 2) (DFO (DFe™)") AD Fe)? fe) 
= 2) —(DF@)f@®). 


This iteration gives the Newton algorithm. 
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Figure 18.2 One iteration of the Newton algorithm for solving an equation 
f(x) = 0 in one variable. 


Algorithm 18.2 NEWTON ALGORITHM FOR SOLVING NONLINEAR EQUATIONS 
given a differentiable function f : R” > R”, an initial point «™. 
For k= 1,2,.2.,46°™" 
1. Form affine approximation at current iterate. Evaluate the Jacobian Df (a) 


and define . 
fae) = fe) + DFPE- 2). 
2. Update iterate by solving linear equations. Set c*t+) as the solution of 
F(a;2™) =0, 
Et) = oh) (Df()) f(a). 


The basic Newton algorithm shares the same shortcomings as the basic Gauss— 
Newton algorithm, i.e., it can diverge, and the iterations terminate if the derivative 
matrix is not invertible. 


Newton algorithm for n = 1. The Newton algorithm is easily understood for 
n = 1. The iteration is 
gD = ol) — a) Fe) (18.9) 
and is illustrated in figure 18.2. To update «“) we form the Taylor approximation 
f@2™) = fe) +7 E)@-—2) 
and set it to zero to find the next iterate «+. If f'(x) Æ 0, the solution of 


f(a; 2)) = 0 is given by the right-hand side of (18.9). If f’(x)) = 0, the Newton 
algorithm terminates with an error. 
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Figure 18.3 The first iterations in the Newton algorithm for solving f(x) = 0, 
for two starting points: « = 0.95 and « = 1.15. 


Figure 18.4 Value of f(a*)) versus iteration number k for Newton’s method 
in the example of figure 18.3, started at xe = 0.95 and « = 1.15. 


Example. The function 


= ——_ 18.10 
ey +e-* ( ) 


has a unique zero at the origin, i.e., the only solution of f(x) = 0 is x = 0. (This 
function is called the sigmoid function, and will make another appearance later.) 
The Newton iteration started at x) = 0.95 converges quickly to the solution x = 0. 


With 2 = 1.15, however, the iterates diverge. This is shown in figures 18.3 
and 18.4. 


18.3 


18.3 Levenberg—Marquardt algorithm 
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Levenberg—Marquardt algorithm 


In this section we describe a variation on the basic Gauss—-Newton algorithm (as 
well as the Newton algorithm) that addresses the shortcomings described above. 
The variation comes directly from ideas we have encountered earlier in this book. It 
was first proposed by Kenneth Levenberg and Donald Marquardt, and is called the 
Levenberg—Marquardt algorithm. It is also sometimes called the Gauss-Newton 
algorithm, since it is a natural extension of the basic Gauss—Newton algorithm 
described above. 


Multi-objective update formulation. The main problem with the Gauss—-Newton 
algorithm is that the minimizer of the approximation || f(x; ||? may be far from 
the current iterate x‘), in which case the approximation f (a;2\)) = f(x) need 
not hold, which implies that || f(«;2)||? ~ ||f(a)||? need not hold. In choosing 
x'+1), then, we have two objectives: We would like || f(a; «*))||? small, and we 
would also like ||a — x(*) ||? small. The first objective is an approximation of what 
we really want to minimize; the second objective expresses the idea that we should 
not move so far that we cannot trust the affine approximation. This suggests that 
we should choose «+ as the minimizer of 


F(a)? +A le- 2 |?, (18.11) 


where A“) is a positive parameter. We add an iteration superscript to the pa- 
rameter A since it can take different values in different iterations. For \‘*) small, 
we primarily minimize the first term, the squared norm of the approximation; for 
\*) large, we choose 2+) near a o, (For A“) = 0, this coincides with the next 
iterate in the basic Gauss—Newton algorithm.) The second term in (18.11) is some- 
times called a trust penalty term, since it penalizes choices of x that are far from 
a), where we cannot trust the affine approximation. The parameter \‘*) is some- 
times called the trust parameter (although ‘distrust parameter’ is perhaps more 
accurate). 

Computing the minimizer of (18.11) is a multi-objective least squares or regu- 
larized least squares problem, and equivalent to minimizing 


Df( al Df (a™)x2® = f(a) 
IL ane Je [PP aoe | 


2 


Since A“) is positive, the stacked matrix in this least squares problem has linearly 
independent columns, even when D f(x‘) does not. It follows that the solution of 
the least squares problem exists and is unique. From the normal equations of the 
least squares problem we can derive a useful expression for «(*+!): 


(Dre) DFe™) 4 7) kt) 
= Df(ae)? iG Hg) _ Pa (k) 
= (DEV D AA n nae 
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and therefore 
—1 
ph) =g) _ (D F(a)? Df (a) + 7) Df(a™)? fle). (18.12) 


The matrix inverse here always exists. 

From (18.12), we see that 2+!) = 2) only if 2Df(«™)7 f(c«™) = 0, ie., 
only when the optimality condition (18.3) holds for «‘*). So like the Gauss-Newton 
algorithm, the Levenberg—Marquardt algorithm stops (or more accurately, repeats 
itself with a+) = 2") only when the optimality condition (18.3) holds. 


Updating the trust parameter. The final issue is how to choose the trust pa- 
rameter \‘*). When A) is too small, r*+») can be far enough away from lk) 
that || f(c@t?)|/? > || f(c@™ ||? can hold, i.e., our true objective function increases, 
which is not what we want. When A‘*) is too large, «+! — x) is small, so the 
affine approximation is good, and the objective decreases (which is good). But in 
this case we have «*+)) very near x“), so the decrease in objective is small, and 
it will take many iterations to make progress. We want \‘") in between these two 
cases, big enough that the approximation holds well enough to get a decrease in 
objective, but not much bigger, which slows convergence. 

Several algorithms can be used to adjust A. One simple method forms «+! 
using the current value of À and checks if the objective has decreased. If it has, we 
accept the new point and decrease À a bit for the next iteration. If the objective 
has not decreased, which means À is too small, we do not update the point «+, 
and increase the trust parameter À substantially. 


Levenberg—Marquardt algorithm. The ideas above can be formalized as the al- 
gorithm given below. 


Algorithm 18.3  LEVENBERG-MARQUARDT ALGORITHM FOR NONLINEAR LEAST 
SQUARES 


given a differentiable function f : R” > R”, an initial point eh), an initial trust 
parameter \“) > 0. 
For k= 1,2,...,k™°* 


1. Form affine approximation at current iterate. Evaluate the Jacobian Df(«) 
and define . 
fee) = fe) + Die) - 2). 


(k+1) 


2. Compute tentative iterate. Set x as minimizer of 


F(a 2)? +O e- P. 
3. Check tentative iterate. 


If || f (ETD)? < || f(a™)||?, accept iterate and reduce A: A“+Y) = 0.8\). 
Otherwise, increase À and do not update z: ACHD = 2) and alt) = g), 
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Stopping criteria. The algorithm is stopped before the maximum number of it- 
erations k™** if either of the following two conditions hold. 


e Small residual: || f(a*+")||? is small enough. This means we have (almost) 
solved the equations f(x) = 0, and therefore (almost) minimized || f(z)]||?. 


e Small optimality condition residual: ||2Df(@)? f(#)|| is small enough, i.e., 
the optimality condition (18.3) almost holds. 


When the algorithm terminates with small optimality condition residual, we 
can say very little for sure about the point z+) computed. This point found may 
be a minimizer of || f(x)||?, or perhaps not. Since the algorithm does not always 
find a minimizer of || f(a)||?, it is a heuristic. Like the k-means algorithm, which 
is also a heuristic, the Levenberg—Marquardt algorithm is widely used in many 
applications, even when we cannot be sure that it has found a point that gives the 
smallest possible residual norm. 


Warm start. In many applications a sequence of similar or related nonlinear least 
squares problems are solved. In these cases it is common to start the Levenberg— 
Marquardt algorithm at the solution of the previously solved problem. If the prob- 
lem to be solved is not much different from the previous problem, this can greatly 
reduce the number of iterations required to converge. This technique is called warm 
starting. It is commonly used in nonlinear model fitting, when multiple models are 
fit as we vary a regularization parameter. 


Multiple runs. It is common to run the Levenberg—Marquardt algorithm from 
several different starting points x“). If the final points found by running the algo- 
rithm from these different starting points are the same, or very close, it increases 
our confidence that we have found a solution of the nonlinear least squares prob- 
lem, but we cannot be sure. If the different runs of the algorithm produce different 
points, we use the best one found, i.e., the one with the smallest value of || f(a)||?. 


Complexity. Each execution of step 1 requires evaluating the derivative matrix 
of f. The complexity of this step depends on the particular function f. Each 
execution of step 2 requires the solution of a regularized least squares problem. 
Using the QR factorization of the stacked matrix this requires 2(m + n)n? flops 
(see §15.5). When m is on the order of n, or larger, this is the same order as mn?. 
When m is much smaller than n, x‘*+) can be computed using the kernel trick 


described in §15.5, which requires 2nm? flops. 


Levenberg—Marquardt update for n = 1. The Newton update for solving f(x) = 0 
when n = 1 is given in (18.9). The Levenberg—Marquardt update for minimizing 


f(a)? is 


(glk) 
(k+1) _ (k) f(a") (k) 
x =g V+ (Fa®yp (x). (18.13) 
For \‘*) = 0 they agree; but when f'(x) = 0, for example, the Levenberg— 
Marquardt update makes sense (since \‘*) > 0), whereas the Newton update is 
undefined. 
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Figure 18.5 Values of f(x‘) and A™® versus the iteration number k 
for the Levenberg—Marquardt algorithm applied to f(x) = (exp(x) — 


exp(—«))/(exp(x)+exp(—2)). The starting point isc = 1.15 and A® = 1. 


Examples 


Nonlinear equation. The first example is the sigmoid function (18.10) from the 
example on page 390. We saw in figures 18.3 and 18.4 that the Gauss—Newton 
method, which reduces to Newton’s method in this case, diverges when the ini- 
tial value x) is 1.15. The Levenberg-Marquardt algorithm, however, solves this 
problem. Figure 18.5 shows the value of the residual f(x‘), and the value of \“), 
for the Levenberg-Marquardt algorithm started from « = 1.15 and A® = 1. It 
converges to the solution x = 0 in around 10 iterations. 


Equilibrium prices. We illustrate algorithm 18.3 with a small instance of the equi- 
librium price problem, with supply and demand functions 


D(p) = exp (E? (log p — log p"™) + d™) , 
S(p) exp (E*(log p — log p™°™) + s™°™) , 


where Ed and E5 are the demand and supply elasticity matrices, d"°™ and s™°™ 
are the nominal demand and supply vectors, and the log and exp appearing in 
the equations apply to vectors elementwise. Figure 18.6 shows the contour lines of 
\|f(p)||?, where f(p) = S(p) — D(p) is the excess supply, for 


p™™ = (2.8,10),  Œ™ = (3.1,2.2), 5" = (2.2, 0.3) 


and 


a_ | —0.5 0.2 a 0.5 —0.3 
ma | 0 —0.5 |’ aia —0.15 0.8 |` 
Figure 18.7 shows the iterates of the algorithm 18.3, started at p = (3,9) and 
A® = 1. The values of || f(p™)||? and the trust parameter A®™) versus iteration k 
are shown in figure 18.8. 


18.3 Levenberg—Marquardt algorithm 395 


10 


P2 
O 


9 10 


Figure 18.6 Contour lines of the square norm of the excess supply f(p) = 
S(p) — D(p) for a small example with two commodities. The point marked 
with a star is the equilibrium prices, for which f(p) = 0. 
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Figure 18.7 Iterates of the Levenberg-Marquardt algorithm started at p = 
(3,9). 
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Figure 18.8 Cost function || f (p™® ||? and trust parameter A? versus iteration 
number k in the example of figure 18.7. 


Location from range measurements. We illustrate algorithm 18.3 with a small 
instance of the location from range measurements problem, with five points a; in a 
plane, shown in figure 18.9. The range measurements p; are the distances of these 
points to the ‘true’ point (1,1), plus some measurement errors. Figure 18.9 also 
shows the level curves of || f(x)||?, and the point (1.18, 0.82) (marked with a star) 
that minimizes || f(x)||?. (This point is close to, but not equal to, the ‘true’ value 
(1,1), due to the noise added to the range measurements.) Figure 18.10 shows the 
graph of || f(x)|). 
We run algorithm 18.3 from three different starting points, 


al = (18.3.5), 2% = (2.2,3.5), w8 = (3.0, 1.5), 


with A“) = 0.1. Figure 18.11 shows the iterates x“) for the three starting points. 
When started at (1.8,3.5) (blue circles) or (3.0,1.5) (brown diamonds) the al- 
gorithm converges to (1.18,0.82), the point that minimizes ||f(a)||?.. When the 
algorithm is started at (2.2,3.5) the algorithm converges to a non-optimal point 
(2.98, 2.12) (which gives a poor estimate of the ‘true’ location (1,1)). 

The values of || f (x®)||? and the trust parameter \‘*) during the iteration are 
shown in figure 18.12. As can be seen from this figure, in the first run of the 
algorithm (blue circles), \“) is increased in the third iteration. Correspondingly, 
x) = ¢“) in figure 18.12. For the second starting point (red squares) \“*) decreases 
monotonically. For the third starting point (brown diamonds) \‘*) increases in 
iterations 2 and 4. 
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Figure 18.9 Contour lines of ||f(x)||? where fi(x) = ||a — aill — pi. The 


dots show the points ai, and the point marked with a star is the point that 
II". 


minimizes || f(a) 


Figure 18.10 Graph of ||f(«)|| in the location from range measurements 
example. 


398 18 Nonlinear least squares 


Figure 18.11 Iterates of the Levenberg—Marquardt algorithm started at three 
different starting points. 
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Figure 18.12 Cost function ||f(a)||? and trust parameter \“) versus iter- 
ation number k for the three starting points. 
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f(x; 0) 


Figure 18.13 Least squares fit of a function f(x; 0) = 01e°2* cos(03” + 04) to 
N = 60 points (a, y). 


Nonlinear model fitting 


The Levenberg—Marquardt algorithm is widely used for nonlinear model fitting. 
As in §13.1 we are given a set of data, «™,...,20%), y®,... yO, where the n- 
vectors «),...,2°%) are the feature vectors, and the scalars y“),...,y©%) are the 
associated outcomes. (So here the superscript indexes the given data; previously 
in this chapter the superscript denoted the iteration number.) 

In nonlinear model fitting, we fit a model of the general form y ~ f(x;0) to the 
given data, where the p-vector 0 contains the model parameters. In linear model 


fitting, f(a;0) is a linear function of the parameters, so it has the special form 
fee) =A fila) bee + On fp(2), 
where fi,..., fp are scalar-valued functions, called the basis functions (See §13.1.) 
In nonlinear model fitting the dependence of f(#;@) on @ is not linear (or affine), 
so it does not have the simple form of a linear combination of p basis functions. 
As in linear model fitting, we choose the parameter 6 by (approximately) min- 
imizing the sum of the squares of the prediction residuals, 


N A F Şi 
Ve 0) - y®), 


i=1 
which is a nonlinear least squares problem, with variable 0. (We can also add a 


regularization term to this objective.) 


Example. Figure 18.13 shows a nonlinear model fitting example. The model is 
an exponentially decaying sinusoid 


f(a;0) = 06%" cos(03x + 04), 
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F(a; 0) 


x 


Figure 18.14 The solid line minimizes the sum of the squares of the orthog- 
onal distances of points to the graph of the polynomial. 


with four parameters 61, 62, 03,04. (This model is an affine function of 0), but it is 
not an affine function of 02, 03, or 04.) We fit the model to N = 60 points (4, y®) 
by minimizing the sum of the squared residuals (18.4) over the four parameters. 


Orthogonal distance regression. Consider the linear in the parameters model 


f (239) = ifile) t+ On f(x), 
with basis functions f; : R” — R, and a data set of N pairs («&,y). The usual 
objective is the sum of squares of the difference between the model prediction 
f(x) and the observed value y, which leads to a linear least squares problem. 
In orthogonal distance regression we use another objective, the sum of the squared 
distances of N points (2, y) to the graph of f, i.e., the set of points of the form 
(u, f(u)). This model can be found by solving the nonlinear least squares problem 


N N 
minimize X (f(u; 0) —y) + Yu = 2 |P? 
i=l i=1 
with variables 01,...,0p, and u... u). In orthogonal distance regression, we 


are allowed to choose the parameters in the model, and also, to modify the feature 
vectors from «™ from u™ to obtain a better fit. (Orthogonal distance regression is 
an example of an error-in-variables model, since it takes into account errors in the 
regressors or independent variables.) Figure 18.14 shows a cubic polynomial fit to 
25 points using this method. The open circles are the points (2, y). The small 
circles on the graph of the polynomial are the points (u, f(w;0)). Roughly 
speaking, we fit a curve that passes near all the data points, as measured by the 
(minimum) distance from the data points to the curve. In contrast, ordinary least 
squares regression finds a curve that minimizes the sum of squares of the vertical 
errors between the curve and the data points. 
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Nonlinear least squares classification 


In this section we describe a nonlinear extension of the least squares classification 
method discussed in chapters 14 and 15, that typically out-performs the basic least 
squares classifier in practice. 

The Boolean classifier of chapter 14 fits a linearly parametrized function 


f(@) = A file) +++ + Op fo(@) 


to the data points (x,y), i = 1,...,N, where y™ € {—1,1}, using linear 


least squares. The parameters 6),...,6, are chosen to minimize the sum squares 
objective 
N ~ 
Se ae (18.14) 
i=1 


plus, optionally, a regularization term. This (hopefully) results in f (2) x y®, 
which is roughly what we want. We can think of f (x) as the continuous prediction 
of the Boolean outcome y. The classifier itself is given by f(x) = sign(f(x)); this 
is the Boolean prediction of the outcome. 

Instead of the sum square prediction error for the continuous prediction, con- 
sider the sum square prediction error for the Boolean prediction, 


N N 
V (f(a) — y)? = S(sign(f(e®)) - y)?. (18.15) 


i=1 i=l 


This is 4 times the number of classification errors we make on the training set. To 
see this, we note that when f(a) = y, which means that a correct prediction 
was made on the ith data point, we have (f(a) —y)? = 0. When f(2) 4 y®, 
which means that an incorrect prediction was made on the ith data point, one of 
the values is +1 and the other is —1, so we have (f(a) — y)? = 4. 

The objective (18.15) is what we really want; the least squares objective (18.14) 
is a surrogate for what we want. But we cannot use the Levenberg—Marquardt algo- 
rithm to minimize the objective (18.15), since the sign function is not differentiable. 
To get around this, we replace the sign function with a differentiable approximation, 
for example the sigmoid function 


e“—e 
= —___ 18.16 
ou) = S (18.16) 
shown in figure 18.15. We choose 0 by solving the nonlinear least squares problem 
of minimizing 
N ~ . . 
Vo F@)) - yy, (18.17) 
i=1 
using the Levenberg—Marquardt algorithm. (We can also add regularization to 
this objective.) Minimizing the nonlinear least squares objective (18.17) is a good 
approximation for choosing the parameter vector 0 so as to minimize the number 
of classification errors made on the training set. 
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Figure 18.15 The sigmoid function ¢. 


Loss function interpretation. We can interpret the objective functions (18.14), 
(18.15), and (18.17) in terms of loss functions that depend on the continuous 
prediction f(a) and the outcome y“). Each of the three objectives has the form 


where £ is a loss function. The first argument of the loss function is a real number, 
and the second argument is Boolean, with values —1 or +1. For the linear least 
squares objective (18.14), the loss function is (u, y) = (u — y)?. For the nonlinear 
least squares objective with the sign function (18.15), the loss function is (u, y) = 
(sign(u) — y)?. For the differentiable nonlinear least squares objective (18.15), the 
loss function is (u, y) = (¢(u) — y)?. Roughly speaking, the loss function ¢(u, y) 
tells us how bad it is to have f(a) = u when y = y™. 


Since the outcome y takes on only two values, —1 and +1, we can plot the loss 
functions as functions of u for these two values of y. Figure 18.16 shows these three 
functions, with the value for y = —1 in the left column and the value for y = +1 
in the right column. We can see that all three loss functions discourage prediction 
errors, since their values are higher for sign(u) Æ y than when sign(u) = y. 
The loss function for nonlinear least squares classification with the sign function 
(shown in the middle row) assesses a cost of 0 for a correct prediction and 4 for 
an incorrect prediction. The loss function for nonlinear least squares classification 
with the sigmoid function (shown in the bottom row) is a smooth approximation 
of this. 
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Figure 18.16 The loss functions ¢(u, y) for linear least squares classification 
(top), nonlinear least squares classification with the sign function (middle), 
and nonlinear least squares classification with the sigmoid function (bottom). 
The left column shows ¢(u,—1) and the right columns shows (u, +1). 
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18.5.1 Handwritten digit classification 


We apply nonlinear least squares classification on the MNIST set of handwritten 
digits used in chapter 14. We first consider the Boolean problem of recognizing the 
digit zero. We use linear features, i.e., 


f(x) =a7B +v, 


where x is the 493-vector of pixel intensities. To determine the parameters v and 
B we solve the nonlinear least squares problem 


minimize 5 \(o((2)78 + v) — y)? +All, (18.18) 


where ¢ is the sigmoid function (18.16) and A is a positive regularization parameter. 
(This \ is the regularization parameter in the classification problem; it has no 
relation to the trust parameter \“) in the iterates of the Levenberg-Marquardt 
algorithm.) 

Figure 18.17 shows the classification error on the training and test sets as a 
function of the regularization parameter A. For A = 100, the classification errors 
on the training and test sets are about 0.7%. This is less than half the 1.6% 
error of the Boolean least squares classifier that used the same features, discussed 
in chapter 14. This improvement in performance, by more than a factor of two, 
comes from minimizing an objective that is closer to what we want (i.e., the number 
of prediction errors on the training set) than the surrogate linear least squares 
objective. The confusion matrices for the training set and test set are given in 
table 18.1. Figure 18.18 shows the distribution of the values of f(x) for the two 
classes of the data set. 
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Prediction Prediction 
Outcome g=+1 g=-1 Total Outcome g=+l yg=-1 Total 
y=+1 5627 296 5923 y=+1 945 35 980 
y=-—l1 148 53929 54077 y=-l 40 8980 9020 


All 5775 54225 60000 All 985 9015 10000 


Table 18.1 Confusion matrices for a Boolean classifier to recognize the digit 
zero. The table on the left is for the training set. The table on the right is 
for the test set. 
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Figure 18.18 The distribution of the values of f(a) used in the Boolean 


classifier (14.1) for recognizing the digit zero. The function f was computed 
by solving the nonlinear least squares problem (18.17). 
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Figure 18.19 Training and test error versus Levenberg-Marquardt iteration 
for A = 100. 


Convergence of Levenberg—Marquardt algorithm. The Levenberg-Marquardt 
algorithm is used to compute the parameters in the nonlinear least squares classifier. 
In this example the algorithm takes several tens of iterations to converge, i.e., until 
the stopping criterion for the nonlinear least squares problem is satisfied. But in 
this application we are more interested in the performance of the classifier, and 
not minimizing the objective of the nonlinear least squares problem. Figure 18.19 
shows the classification error of the classifier (on the training and test data sets) 
with parameter 0“), the kth iterate of the Levenberg-Marquardt algorithm. We 
can see that the classification errors reach their final values of 0.7% after just a few 
iterations. This phenomenon is very typical in nonlinear data fitting problems. Well 
before convergence, the Levenberg—Marquardt algorithm finds model parameters 
that are just as good (as judged by test error) as the parameters obtained when 
the algorithm converges. 


Feature engineering. After adding the 5000 random features used in chapter 14, 
we obtain the training and test classification errors shown in figure 18.20. The 
error on the training set is zero for small A. For A = 1000, the error on the test 
set is 0.24%, with the confusion matrix in table 18.2. The distribution of f(x) 
on the training set in figure 18.21 shows why the training error is zero. 

Figure 18.22 shows the classification errors versus Levenberg—Marquardt iter- 
ation, if we start the Levenberg—Marquardt algorithm with 8 = 0, v = 0. (This 
implies that the values computed in the first iteration are the coefficients of the 
linear least squares classifier.) The error on the training set is exactly zero at itera- 
tion 5. The error on the test set is almost equal to its final value after one iteration. 
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Figure 18.20 Boolean classification error in percent versus A, after adding 
5000 random features. 


Prediction 


Outcome ĝ=+1 y=-1 Total 


y=+1 967 13 980 
y=-l 11 9009 9020 


All 978 9022 10000 


Table 18.2 Confusion matrix on the test set for the Boolean classifier to 
recognize the digit zero after addition of 5000 new features. 
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Figure 18.21 The distribution of the values of f(a) used in the Boolean 
classifier (14.1) for recognizing the digit zero, after addition of 5000 new 


features. The function f was computed by solving the nonlinear least squares 
problem (18.17). 
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Figure 18.22 Training and test error versus Levenberg-Marquardt iteration 
for A = 1000. 
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Figure 18.23 Multiclass classification error in percent versus A. 


Multi-class classifier. Next we apply the nonlinear least squares method to the 
multi-class classification of recognizing the ten digits in the MNIST data set. For 
each digit k, we compute a Boolean classifier fela) = a" By, + vp by solving a 
regularized nonlinear least squares problem (18.18). The same value of A is used 
in the ten nonlinear least squares problems. The Boolean classifiers are combined 
into a multi-class classifier 


f(x) = argmax (x7 By + vk). 
k=1,...,10 


Figure 18.23 shows the classification errors versus A. The test set confusion matrix 
(for A = 1) is given in table 18.3. The classification error on the test set is 7.6%, 
down from the 13.9% error we obtained for the same set of features with the least 
squares method of chapter 14. 


Feature engineering. Figure 18.24 shows the error rates when we add the 5000 
randomly generated features. The training and test error rates are now 0.02% and 
2%. The test set confusion matrix for A = 1000 is given in table 18.4. This classifier 
has matched human performance in classifying digits correctly. Further, or more 
sophisticated, feature engineering can bring the test performance well below what 
humans can achieve. 
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Prediction 

Digit 0 1 2 3 4 5 6 7 8 9 Total 
0 964 0 0 2 0 2 5 3 3 1 980 
1 0 1112 4 3 0 1 4 1 10 0 1135 
2 5 5 934 13 7 3 13 10 38 4 1032 
3 3 0 19 926 1 21 2 8 21 9 1010 
4 1 2 4 2 917 0 7 ul 10 38 982 
5 10 2 2 31 10 782 17 7 23 8 892 
6 8 3 3 1 5 20 910 1 7 0 958 
7 2 6 25 5 11 5 0 947 4 23 1028 
8 13 10 4 18 16 27 8 9 865 4 974 
9 8 6 0 12 43 11 1 19 23 886 1009 


All 1014 1146 995 1013 1010 872 967 1006 1004 973 10000 


Table 18.3 Confusion matrix for test set. The error rate is 7.6%. 
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Figure 18.24 Multiclass classification error in percent versus A after adding 
5000 random features. 
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Prediction 
Digit 0 1 2 3 4 5 6 7 8 9 Total 
0 972 1 1 1 0 1 1 1 2 0 980 
1 0 1124 2 2 0 0 3 1 3 0 1135 
2 5 0 1006 1 3 0 2 6 9 0 1032 
3 0 0 3 986 0 5 0 3 7 6 1010 
4 0 0 4 1 966 0 4 1 0 6 982 
5 2 0 2 5 2 875 5 0 1 0 892 
6 7 2 0 1 3 2 941 0 2 0 958 
7 1 7 6 1 2 0 0 1003 3 5 1028 
8 3 0 0 4 4 5 3 4 949 2 974 
9 2 5 0 5 6 4 1 6 2 978 1009 
All 992 1139 1024 1007 986 892 960 1025 978 997 10000 


Table 18.4 Confusion matrix for test set after adding 5000 features. The 


error rate is 2.0%. 
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18.1 


18.2 


18.3 


Exercises 


Lambert W -function. The Lambert W-function, denoted W : [0,co) —> R, is defined as 
W (u) = x, where z is the unique number x > 0 for which ze” = u. (The notation just 
means that we restrict the argument x to be nonnegative.) The Lambert function arises in 
a variety of applications, and is named after the mathematician Johann Heinrich Lambert. 
There is no analytical formula for W (u); it must be computed numerically. In this exercise 
you will develop a solver to compute W (u), given a nonnegative number u, using the 
Levenberg-Marquardt algorithm, by minimizing f(x)? over x, where f(x) = xe” — u. 


(a) Give the Levenberg—Marquardt update (18.13) for f. 


(b) Implement the Levenberg—Marquardt algorithm for minimizing f(x)”. You can start 


with z™® = 1 and A® = 1 (but it should work with other initializations). You can 
stop the algorithm when |f(x)| is small, say, less than 107°. 


Internal rate of return. Let the n-vector c denote a cash flow over n time periods, with 
positive entries meaning cash received and negative entries meaning payments. We assume 
that its NPV (net present value; see page 22) with interest rate r > 0 is given by 


N@)=) ae 


i=l 


The internal rate of return (IRR) of the cash flow is defined as the smallest positive value 
of r for which N(r) = 0. The Levenberg—Marquardt algorithm can be used to compute 
the IRR of a given cash flow sequence, by minimizing N (r)? over r. 


(a) Work out a specific formula for the Levenberg—Marquardt update for r, i.e., (18.13). 


(b) Implement the Levenberg—Marquardt algorithm to find the IRR of the cash flow 
sequence 
c= (-1s, 0.3 15, 0.6 16), 
where the subscripts give the dimensions. (This corresponds to three periods in 
which you make investments, which pay off at one rate for 5 periods, and a higher 
rate for the next 6 periods.) You can initialize with r© = 0, and stop when N(r“))? 
is small. Plot N(r“))? versus k. 


A common form for the residual. In many nonlinear least squares problems the residual 
function f : R” —> R” has the specific form 


fiz) = ¢i(af2—b:), 1=1,...,m, 


where a; is an n-vector, b; is a scalar, and ¢; : R — R is a scalar-valued function of a 
scalar. In other words, f:(x) is a scalar function of an affine function of x. In this case 
the objective of the nonlinear least squares problem has the form 


m 


ILEI =Y (d(T x — b))?. 


i=l 


We define the m x n matrix A to have rows a} ,...,a2,, and the m-vector b to have entries 
b1,...,bm. Note that if the functions ¢; are the identity function, i.e., di(w) = u for all u, 
then the objective becomes || Ax —b||?, and in this case the nonlinear least squares problem 
reduces to the linear least squares problem. Show that the derivative matrix Df (x) has 
the form 
D f(z) = diag(a) A, 

where d; = ¢;(ri) fori =1,...,m, with r = Ag — b. 

Remark. This means that in each iteration of the Gauss-Newton method we solve a 
weighted least squares problem (and in the Levenberg—Marquardt algorithm, a regularized 
weighted least squares problem); see exercise 12.4. The weights change in each iteration. 
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Fitting an exponential to data. Use the Levenberg—Marquardt algorithm to fit an expo- 
nential function of the form f(x; 0) = 01e?” to the data 


0, 1, ..., 5, 5.2, 4.5, 2.7, 2.5, 2.1, 1.9. 


(The first list gives x; the second list gives y.) Plot your model f(x; Ô) versus x, along 
with the data points. 


Mechanical equilibrium. A mass m, at position given by the 2-vector x, is subject to three 
forces acting on it. The first force F®*” is gravity, which has value F®*” = —mg(0,1), 
where g = 9.8 is the acceleration of gravity. (This force points straight down, with a force 
that does not depend on the position x.) The mass is attached to two cables, whose other 
ends are anchored at (2-vector) locations a1 and a2. The force F; applied to the mass by 
cable 7 is given by 


Fi = T.(ai — £)/||ai — all, 


where T; is the cable tension. (This means that each cable applies a force on the mass 
that points from the mass to the cable anchor, with magnitude given by the tension T;.) 
The cable tensions are given by 


max{|la; — x|| — Li, 0} 


Ti =k 
Li 


where k is a positive constant, and L; is the natural or unloaded length of cable 7 (also 
positive). In words: The tension is proportional to the fractional stretch of the cable, 
above its natural length. (The max appearing in this formula means the tension is not 
a differentiable function of the position, when ||a; — z|| = Li, but we will simply ignore 
this.) The mass is in equilibrium at position x if the three forces acting on it sum to zero, 


FON + Fi + Fo =0. 


We refer to the left-hand side as the residual force. It is a function of mass position z, 
and we write it as f(x). 


Compute an equilibrium position for 
ai = (3,2), az = (—1, 1), Li=3; Lz=2, m=], k = 100, 


by applying the Levenberg-Marquardt algorithm to the residual force f(x). Use r® = 
(0,0) as starting point. (Note that it is important to start at a point where Tı > 0 and 
Tə > 0, because otherwise the derivative matrix Df(a) is zero, and the Levenberg— 
Marquardt update gives r” = a.) Plot the components of the mass position and the 
residual force versus iterations. 


Fitting a simple neural network model. A neural network is a widely used model of the 
form ĝ = f(a;0), where the n-vector x is the feature vector and the p-vector 0 is the 
model parameter. In a neural network model, the function f is not an affine function of 
the parameter vector 0. In this exercise we consider a very simple neural network, with 
two layers, three internal nodes, and two inputs (i.e., n = 2). This model has p = 13 
parameters, and is given by 


f(x;0) = 0,(02%1 + 03x2 + 04) + 056(06%1 + 07x2 + Og) 
+ O99(O10%1 + 01102 + O12) + O13 


where ¢: R — R is the sigmoid function defined in (18.16). This function is shown as 
a signal flow graph in figure 18.25. In this graph each edge from an input to an internal 
node, or from an internal node to the output node, corresponds to multiplication by one 
of the parameters. At each node (shown as the small filled circles) the incoming values 
and the constant offset are added together, then passed through the sigmoid function, to 
become the outgoing edge value. 
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Figure 18.25 Signal flow graph of a simple neural network. 


Fitting such a model to a data set consisting of the n-vectors «),.. . £) and the 
associated scalar outcomes y™,... yy™ by minimizing the sum of the squares of the 
residuals is a nonlinear least squares problem with objective (18.4). 


(a) Derive an expression for Vo f (x;0). Your expression can use ¢ and ¢’, the sigmoid 
function and its derivative. (You do not need to express these in terms of exponen- 
tials.) 


(b) Derive an expression for the derivative matrix Dr(0), where r : RP? > R“ is the 
vector of model fitting residuals, 


r(6)i = f(e;0)-—y, i=1,...,N. 


Your expression can use the gradient found in part (a). 


(c) Try fitting this neural network to the function g(£1, £2) = x12%2. First generate 
N = 200 random points x and take y = («);(a)s for i = 1,...,200. Use 
the Levenberg—Marquardt algorithm to try to minimize 


£(9) = 


with y = 1075. Plot the value of f and the norm of its gradient versus iteration. 
Report the RMS fitting error achieved by the neural network model. Experiment 
with choosing different starting points to see the effect on the final model found. 


Ir(®) l? + All? 


(d) Fit the same data set with a (linear) regression model f'"(a;8,v) = 78+ and 
report the RMS fitting error achieved. (You can add regularization in your fitting, 
but it won’t improve the results.) Compare the RMS fitting error with the neural 
network model RMS fitting error from part (c). 


Remarks. Neural networks used in practice employ many more regressors, layers, and in- 
ternal modes. Specialized methods and software are used to minimize the fitting objective, 
and evaluate the required gradients and derivatives. 
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Figure 18.26 Two-link robot manipulator in a plane. 


18.7 Robot manipulator. Figure 18.26 shows a two-link robot manipulator in a plane. The 
robot manipulator endpoint is at the position 


_ cos 64 cos(01 + 02) 
p= | sin 0; | + Le | sin(ĝı +62) |’ 
where Lı and Lz are the lengths of the first and second links, 6; is the first joint angle, and 


02 is second joint angle. We will assume that Lə < Lı, i.e., the second link is shorter than 
the first. We are given a desired endpoint position p*®*, and seek joint angles 0 = (61, 02) 


for which p = p°“. 


This problem can be solved analytically. We find 02 using the equation 


pt]? = (Lı + Lz cos 2)” + (L2 sin 02)” 
= L? + LÈ +2L,L2cos bo. 


When Lı — Lo < lpt] < Lı + L2, there are two choices of 02 (one positive and one 
negative). For each solution 62 we can then find 4; using 


des _ cos 61 cos(61 + 62) 
P T | sin 0; | + Le | sin(@; + 02) 
_ Lı + Le cos b2 — Lə sin 62 cos 04 
E Lə sin ĝ2 Li + Le cos 62 sinĝı |7 


In this exercise you will use the Levenberg-Marquardt algorithm to find joint angles, by 
minimizing ||p — p?*||?. 


(a) Identify the function f(@) in the nonlinear least squares problem, and give its deriva- 
tive Df(0). 

(b) Implement the Levenberg—Marquardt algorithm to solve the nonlinear least squares 
problem. Try your implementation on a robot with Lı = 2, L2 = 1, and the desired 


endpoints 
(1.0,0.5), (—2.0,1.0), (—0.2,3.1). 


For each endpoint, plot the cost function || f(@)||? versus iteration number k. 


Note that the norm of the last endpoint exceeds Lı + L2 = 3, so there are no joint 
angles for which p = p***. Explain the angles your algorithm finds in this case. 
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ici a 


Figure 18.27 Ellipse with center (c1,c2), and radii r+ 6 and r — ô. The 
largest semi-axis makes an angle a with respect to horizontal. 


Figure 18.28 Ellipse fit to 50 points in a plane. 


18.8 Fitting an ellipse to points in a plane. An ellipse in a plane can be described as the set 
of points 
fto) = cı + rcos(a + t) + d cos(a — t) 
i co+rsin(a+t)+dsin(a—t) |’ 


where t ranges from 0 to 27. The vector 0 = (ci, c2,7,6, œ) contains five parameters, with 
geometrical meanings illustrated in figure 18.27. We consider the problem of fitting an 
ellipse to N points «™,...,2™? in a plane, as shown in figure 18.28. The circles show 
the N points. The short lines connect each point to the nearest point on the ellipse. We 
will fit the ellipse by minimizing the sum of the squared distances of the N points to the 
ellipse. 


(a) The squared distance of a data point x to the ellipse is the minimum of || f(t; 0) — 
a ||? over the scalar t. Minimizing the sum of the squared distances of the data 
points 2... , a2) to the ellipse is therefore equivalent to minimizing 


N 
So FE; 9) = 2 |)? 


over t,...,¢ and @. Formulate this as a nonlinear least squares problem. Give 
expressions for the derivatives of the residuals. 


Exercises 


417 


(b) Use the Levenberg—Marquardt algorithm to fit an ellipse to the 10 points: 
(0.5,1.5), (—0.3,0.6), (1.0,1.8), (—0.4,0.2), (0.2, 1.3) 
(0.7,0.1), (2.3,0.8),  (1.4,0.5), (0.0,0.2), (2.4, 1.7). 


To select a starting point, you can choose parameters @ that describe a circle with 
radius one and center at the average of the data points, and initial values of t that 
minimize the objective function for those initial values of 0. 
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Chapter 19 


Constrained nonlinear 
least squares 


In this chapter we consider an extension of the nonlinear least squares problem 
that includes nonlinear constraints. Like the problem of solving a set of nonlinear 
equations, or finding a least squares approximate solution to a set of nonlinear 
equations, the constrained nonlinear least squares problem is in general hard to 
solve exactly. We describe a heuristic algorithm that often works well in practice. 


Constrained nonlinear least squares 


In this section we consider an extension of the nonlinear least squares problem (18.2) 
that includes equality constraints: 


minimize — || f(x)||? 


subject to g(x) =0, wea) 


where the n-vector x is the variable to be found. Here f(x) is an m-vector, and 
g(x) is a p-vector. We sometimes write out the components of f(x) and g(x), to 
express the problem as 


minimize fy (x)? apap Jml) 
subject to gj(z)=0, i=1,...,p. 


We refer to f(x) as the ith (scalar) residual, and g;(x) = 0 as the ith (scalar) 
equality constraint. When the functions f and g are affine, the equality constrained 
nonlinear least squares problem (19.1) reduces to the (linear) least squares problem 
with equality constraints from chapter 16. 

We say that a point x is feasible for the problem (19.1) if it satisfies g(x) = 0. 
A point ĉ is a solution of the problem (19.1) if it is feasible and has the smallest 
objective among all feasible points, i.e., if whenever g(x) = 0, we have || f(a)||? > 


IFP. 
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19.1.1 


Like the nonlinear least squares problem, or solving a set of nonlinear equations, 
the constrained nonlinear least squares problem is in general hard to solve exactly. 
But the Levenberg—Marquardt algorithm for solving the (unconstrained) nonlinear 
least squares problem (18.2) can be leveraged to handle the problem with equality 
constraints. We will describe a basic algorithm below, the penalty algorithm, and 
a variation on it that works much better in practice, the augmented Lagrangian al- 
gorithm. These algorithms are heuristics for (approximately) solving the nonlinear 
least squares problem (19.1). 


Linear equality constraints. One special case of the constrained nonlinear least 
squares problem (19.1) is when the constraint function g is affine, in which case 
the constraints g(x) = 0 can be written Ca = d for some p x n matrix C and 
a p-vector d. In this case the problem (19.1) is called a nonlinear least squares 
problem with linear equality constraints. It can be (approximately) solved by the 
Levenberg—Marquardt algorithm described in chapter 18, by simply adding the 
linear equality constraints to the linear least squares problem that is solved in 
step 2. The more challenging problem is the case when g is not affine. 


Optimality condition 


Using Lagrange multipliers (see §C.3) we can derive a condition that any solu- 
tion of the constrained nonlinear least squares problem (19.1) must satisfy. The 
Lagrangian for the problem (19.1) is 


L(x, z) = || f(z)? + zrgi(x) +--+ + zp9p(2) = IIF(@)I? + 9(@)* 2, (19.2) 


where the p-vector z is the vector of Lagrange multipliers. The method of Lagrange 
multipliers tells us that for any solution ĉ of (19.1), there is a set of Lagrange 
multipliers 2 that satisfy 


OL 
Ox; 


OL 
Oz; 


(4,2) =0, i=1,...,n, (4,2) =0, i=1,...,p 
(provided the rows of Dg(#) are linearly independent). The p-vector Z is called an 
optimal Lagrange multiplier. 

The second set of equations can be written as g;(@) = 0, i =1,...,p, in vector 
form 


g(&) = 0, (19.3) 


i.e., ĉ is feasible, which we already knew. The first set of equations can be written 
in vector form as 


2D f (&)* f(#) + Dg(#)7 2 = 0. (19.4) 


This equation is the extension of the condition (18.3) for the unconstrained non- 
linear least squares problem (18.2). The equation (19.4), together with (19.3), i.e., 
& is feasible, form the optimality conditions for the problem (19.1). 

If ĉ is a solution of the constrained nonlinear least squares problem (19.1), then 
it satisfies the optimality condition (19.4) for some Lagrange multiplier vector 2 
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(provided the rows of Dg(ĉ) are linearly independent). So ĉ and Z satisfy the 
optimality conditions. 

These optimality conditions are not sufficient, however; there can be choices of 
x and z that satisfy them, but x is not a solution of the constrained nonlinear least 
squares problem. 


Penalty algorithm 


We start with the observation (already made on page 340) that the equality con- 
strained problem can be thought of as a limit of a bi-objective problem with ob- 
jectives || f(x)||? and ||g(x)||?, as the weight on the second objective increases to 
infinity. Let u be a positive number, and consider the composite objective 


ILOP + allg. (19.5) 


This can be (approximately) minimized using the Levenberg-Marquardt algorithm 


applied to 
Il A | 


By minimizing the composite objective (19.5), we do not insist that g(x) is zero, 
but we assess a cost or penalty ju||g(x)||? on the residual. If we solve this for large 
enough jz, we should obtain a choice of x for which g(a) is very small, and || f()||? 
is small, i.e., an approximate solution of (19.1). The second term s||g(zx)||? is a 
penalty imposed on choices of x with nonzero g(x). 

Minimizing the composite objective (19.5) for an increasing sequence of values 
of u is known as the penalty algorithm. 


i (19.6) 


Algorithm 19.1 PENALTY ALGORITHM FOR CONSTRAINED NONLINEAR LEAST SQUARES 


given differentiable functions f : R” > R” and g : R” — R’, and an initial 
point «. Set po =1. 


For k= 1,2,...,k™°* 


1. Solve unconstrained nonlinear least squares problem. Set x‘*t to be the (ap- 
proximate) minimizer of 


2 k 2 
IFEN? +u lgl 
using the Levenberg-Marquardt algorithm, starting from initial point r”), 


2. Update wp: FHD = an), 


The penalty algorithm is stopped early if ||g(2“)|| is small enough, i.e., the equality 
constraint is almost satisfied. 

The penalty algorithm is simple and easy to implement, but has an important 
drawback: The parameter pi“) rapidly increases with iterations (as it must, to drive 
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19.3 


g(a) to zero). When the Levenberg—Marquardt algorithm is used to minimize (19.5) 
for very high values of u, it can take a large number of iterations or simply fail. 
The augmented Lagrangian algorithm described below gets around this drawback, 
and gives a much more reliable algorithm. 

We can connect the penalty algorithm iterates to the optimality condition (19.4). 
The iterate x(*+) (almost) satisfies the optimality condition for minimizing (19.5), 


Defining 
2(k+1) = 2p) g(a*tD) 


as our estimate of a suitable Lagrange multiplier in iteration k +1, we see that the 
optimality condition (19.4) (almost) holds for «+ and z+). (The feasibility 
condition g(a*)) = 0 only holds in the limit as k — oo.) 


Augmented Lagrangian algorithm 


The augmented Lagrangian algorithm is a modification of the penalty algorithm 
that addresses the difficulty associated with the penalty parameter p“*) becoming 
very large. It was proposed by Magnus Hestenes and Michael Powell in the 1960s. 


Augmented Lagrangian. The augmented Lagrangian for the problem (19.1), with 
parameter u > 0, is defined as 


Ly(x, 2) = L(x, p) + ullig)? = fŒ? + gle)" + allg)’. (19.7) 


le 


vi 


This is the Lagrangian, augmented with the new term p||g(x)||*; alternatively, it 
can be interpreted as the composite objective function (19.5) used in the penalty 
algorithm, with the Lagrange multiplier term g(a)" z added. 
The augmented Lagrangian (19.7) is also the ordinary Lagrangian associated 
with the problem 
minimize — || f(a)||? + ullo)? 
subject to g(a) =0. 


This problem is equivalent to the original constrained nonlinear least squares prob- 
lem (19.1): A point x is a solution of one if and only if it is a solution of the other. 
(This follows since the term j||g(a)||? is zero for any feasible z.) 


Minimizing the augmented Lagrangian. In the augmented Lagrangian algorithm 
we minimize the augmented Lagrangian over the variable x for a sequence of values 
of u and z. We show here how this can be done using the Levenberg—Marquardt 
algorithm. We first establish the identity 


Ly (2,2) = [SEM + allg) + 2/201? — ulle / 20. (19.8) 
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We expand the second term on the right-hand side to get 


ulige) + 2/(2n)|)? 
= plig)? + 2ug(2)" (2/(24)) + ullz/ (24)? 
= g(x)" + ullg(2)|? + wllz/(2u) |’. 


Substituting this into the right-hand side of (19.8) verifies the identity. 


When we minimize L, (x, z) over the variable x, the term —ju||z/(2)||? in (19.8) 
is a constant (i.e., does not depend on x), and does not affect the choice of x. It 
follows that we can minimize L,,(z,z) over x by minimizing the function 


ILOP + wllg(@) + 2/(2n) |, (19.9) 


which in turn can be expressed as 


2 


F(x) | l (19.10) 


| | VEg(a) + 2/27) 


This can be be (approximately) minimized using the Levenberg—Marquardt algo- 
rithm. 


Any minimizer % of L,,(x,z) (or equivalently, (19.9)) satisfies the optimality 
condition 


(2) + 2uDg(@)* (9(@) + 2/(2u)) 
(2) + Dg(&)* (2ug(@) + 2). 


From this equation we can observe that if minimizes the augmented Lagrangian 
and is also feasible (i.e., g(Z) = 0), then it satisfies the optimality condition (19.4) 
with the vector z as the Lagrange multiplier. The bottom equation also suggests 
a good choice for updating the Lagrange multiplier vector z if % is not feasible. In 
this case the choice 


Žž = z + 2u9(Z) (19.11) 


satisfies the optimality condition (19.4) with ž and 2. 


The augmented Lagrangian algorithm alternates between minimizing the aug- 
mented Lagrangian (approximately, using the Levenberg-Marquardt algorithm), 
and updating the parameter z (our estimate of a suitable Lagrange multiplier) us- 
ing the suggestion (19.11) above. The penalty parameter u is increased only when 
needed, when ||g(x)|| does not sufficiently decrease. 
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Algorithm 19.2 AUGMENTED LAGRANGIAN ALGORITHM 


given differentiable functions f : R” —> R” and g : R” —> R?, and an initial 
point «, Set z® = 0, pO =1. 


For k = 1,2,...,k™°* 


1. Solve unconstrained nonlinear least squares problem. Set a *+1) to be the (ap- 
proximate) minimizer of 


FI? + wo lgl) + 2 /(2n)|)? 
using Levenberg—Marquardt algorithm, starting from initial point r”), 


2. Update ohh), 
ght) — 2(6) 1 2u g(a FFD), 


3. Update py. 


uD = u™® |Ig(e@®)|| < 0.25g] 
2u) g(t] > 0.25|1g(2)|]. 


The augmented Lagrangian algorithm is stopped early if g(x") is very small. 
Note that due to our particular choice of how z*) is updated, the iterate z+» 
(almost) satisfies the optimality condition (19.4) with 2+), 

The augmented Lagrangian algorithm is not much more complicated than the 
penalty algorithm, but it works much better in practice. In part this is because 
the penalty parameter u(¥) does not need to increase as much as the algorithm 
proceeds. 


Example. We consider an example with two variables and 


zı +exp(—22) 


F(1, 22) = x? +2%2+1 


| g(£1, £2) = £1 +2} + £2 + T2. 

Figure 19.1 shows the contour lines of the cost function ||f(a)||? (solid lines) and 
the constraint function g(x) (dashed lines). The point ĉ = (0,0) is optimal with 
corresponding Lagrange multiplier 2 = —2. One can verify that g(#) = 0 and 


2D #4)" f(4) + Dala)" =2| be : | i | -2| i | =0. 


The circle at x = (—0.666,—0.407) indicates the position of the unconstrained 
minimizer of || f (x)||?. 

The augmented Lagrangian algorithm is started from the point «) = (0.5, —0.5). 
Figure 19.2 illustrates the first six iterations. The solid lines are the contour lines 
for L,,(2, z(*)), the augmented Lagrangian with the current value of the Lagrange 
multiplier. For comparison, we also show in figure 19.3 the first six iterations of 
the penalty algorithm, started from the same point. The solid lines are the contour 
lines of || f(x)||? + allg). 

In figure 19.4 we show how the algorithms converge. The horizontal axis is 
the cumulative number of Levenberg—Marquardt iterations. Each of these requires 
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T2 


Figure 19.1 Contour lines of the cost function ||f(z)||? (solid line) and the 
constraint function g(x) (dashed line) for a nonlinear least squares problem 
in two variables with one equality constraint. 


the solution of one linear least squares problem (minimizing (19.10) and (19.6), 
respectively). The two lines show the absolute value of the feasibility residual 
|g(a*))|, and the norm of the optimality condition residual, 


2D f(E f(a) ra Dg(a™)? 2 Il. 


The vertical jumps in the optimality condition norm occur in steps 2 and 3 of the 
augmented Lagrangian algorithm, and in step 2 of the penalty algorithm, when the 
parameters u and z are updated. 

Figure 19.5 shows the value of the penalty parameter u versus the cumulative 
number of Levenberg-Marquardt iterations in the two algorithms. 


Nonlinear control 


A nonlinear dynamical system has the form of an iteration 
Tk+1 = f (£k, Uk), k=1,2,...,N, 


where the n-vector rz is the state, and the m-vector ux, is the input or control, at 
time period k. The function f : R"t™ — R” specifies what the next state is, as a 
function of the current state and the current input. When f is an affine function, 
this reduces to a linear dynamical system. 

In nonlinear control, the goal is to choose the inputs u1,...,un—1 to achieve 
some goal for the state and input trajectories. In many problems the initial state 
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wD) =1, 29 =0 u® = 2, z2) = —0.893 


p® = 4, z@ = —1.976 


0.5 


0.5 } ! ; A 0.5 


| 
—0.5 0 0.5 —0.5 0 0.5 


Figure 19.2 First six iterations of the augmented Lagrangian algorithm. 
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0.5 


0.5 —0.5 


Figure 19.3 First six iterations of the penalty algorithm. 
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Figure 19.4 Feasibility and optimality condition errors versus the cumulative 
number of Levenberg—Marquardt iterations in the augmented Lagrangian 
algorithm (top) and the penalty algorithm (bottom). 
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Figure 19.5 Penalty parameter u versus cumulative number of Levenberg— 
Marquardt iterations in the augmented Lagrangian algorithm and the 
penalty algorithm. 


xı is given, and the final state xy is specified. Subject to these constraints, we may 
wish the control inputs to be small and smooth, which suggests that we minimize 


N N-1 
So lul? +y SS lurs — wall?, 
k=1 k=1 


where y > 0 is a parameter used to trade off input size and smoothness. (In many 
nonlinear control problems the objective also involves the state trajectory.) 

We can formulate the nonlinear control problem, with a norm squared objective 
that involves the state and input, as a large constrained least problem, and then 
solve it using the augmented Lagrangian algorithm. We illustrate this with a 
specific example. 


Control of a car. Consider a car with position p = (p1, p2) and orientation (an- 
gle) 6. The car has wheelbase (length) L, steering angle ¢, and speed s (which can 
be negative, meaning the car moves in reverse). This is illustrated in figure 19.6. 

The wheelbase L is a known constant; all of the other quantities p, 0, ¢, and s 
are functions of time. The dynamics of the car motion are given by the differential 
equations 


i (t) = s(t)cos@(t), 
da (t) = s(t)sin A(t), 
dé 


— 
oH 

~~" 
II 


a (s(t)/L) tan ¢(t). 
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Figure 19.6 Simple model of a car. 


Here we assume that the steering angle is always less than 90°, so the tangent 
term in the last equation makes sense. The first two equations state that the car is 
moving in the direction 0(t) (its orientation) at speed s(t). The last equation gives 
the change in orientation as a function of the car speed and the steering angle. For 
a fixed steering angle and speed, the car moves in a circle. 

We can control the speed s and the steering angle ¢; the goal is to move the car 
over some time period from a given initial position and orientation to a specified 
final position and orientation. 

We now discretize the equations in time. We take a small time interval h, and 
obtain the approximations 


pi(t +h) ~ pi(t) + hs(t) cos A(t), 
po(t +h) ~ po(t) + hs(t) sin A(t), 
O(t +h) x O) + h(s(t)/L) tan g(t). 


We will use these approximations to derive nonlinear state equations for the car 
motion, with state x, = (pi (kh), po(kh), 0(kh)) and input ux = (s(kh), d(kh)). We 
have 


Ck+1 = f (£k, Uk), 


with 


F (xx, Uk) = £k + h(ux)1 sin(x 
We now consider the nonlinear optimal control problem 


N N-1 
minimize > Jukl? +y >> lluk — uel? 
= k=1 


subject to z2 = f(0, u1) (19.12) 
k41 = f (Le, Ue), k=2,...,N—-1 
Lfinal = f(@N, UN), 


with variables u1,..., uy, and x2,...,2N. 
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Tfinal = (0, 1, 0) Vfinal = (0, 1, 1/2) 
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Figure 19.7 Solution trajectories of 19.12 for different end states £fnal. The 
outline of the car shows the position (pı (kh), p2(kh)), orientation 0(kh), and 
the steering angle ¢(kh) at time kh. 


Figure 19.7 shows solutions for 
L=0.1, N = 50, h =0.1, y= 10, 


and different values of @gnaj. They are computed using the augmented Lagrangian 
algorithm. The algorithm is started at the same starting point for each example. 
The starting point for the input variables ux is randomly chosen, the starting point 
for the states x; is zero. 
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Figure 19.8 The two inputs (speed and steering angle) for the trajectories 


in figure 19.7. 
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Figure 19.9 Feasibility and optimality condition residuals in the augmented 
Lagrangian algorithm for computing the trajectories in figure 19.7. 
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19.1 


19.2 


Exercises 


Projection on a curve. We consider a constrained nonlinear least squares problem with 
three variables x = (11, £2, £3) and two equations: 


minimize (xı — 1)? + (a2 — 1)? + (x3 — 1)? 
subject to r? + 0.503 +23 -1=0 
0.80? + 2.503 + £3 + 20103 — £1 — T2 — £3 — 1 = 0. 


The solution is the point closest to (1,1,1) on the nonlinear curve defined by the two 
equations. 


(a) Solve the problem using the augmented Lagrangian method. You can start the 
algorithm at «™ = 0, z®™® = 0, po = 1, and start each run of the Levenberg— 
Marquardt method with A® = 1. Stop the augmented Lagrangian method when 
the feasibility residual ||g(x®)|| and the optimality condition residual 


2D f(a)? f(c™) 4 Dg(a)? 2 || 


are less than 1075. Make a plot of the two residuals and of the penalty parameter 
p versus the cumulative number of Levenberg—Marquardt iterations. 


(b) Solve the problem using the penalty method, started at xc = 0 and po = J, 
and with the same stopping condition. Compare the convergence and the value of 
the penalty parameter with the results for the augmented Lagrangian method in 
part (a). 

Portfolio optimization with downside risk. In standard portfolio optimization (as de- 
scribed in §17.1) we choose the weight vector w to achieve a given target mean return, 
and to minimize deviations from the target return value (i.e., the risk). This leads to the 
constrained linear least squares problem (17.2). One criticism of this formulation is that 
it treats portfolio returns that exceed our target value the same as returns that fall short 
of our target value, whereas in fact we should be delighted to have a return that exceeds 
our target value. To address this deficiency in the formulation, researchers have defined 
the downside risk of a portfolio return time series T-vector r, which is sensitive only to 
portfolio returns that fall short of our target value p*®™. The downside risk of a portfolio 
return time series (T-vector) r is given by 


T 
D= =>. (max {p'*" — r, 0}) 
t=1 


tar 


The quantity max {p*®™ — r+,0} is the shortfall, i.e., the amount by which the return in 
period t falls short of the target; it is zero when the return exceeds the target value. The 
downside risk is the mean square value of the return shortfall. 


(a) Formulate the portfolio optimization problem, using downside risk in place of the 
usual risk, as a constrained nonlinear least squares problem. Be sure to explain what 
the functions f and g are. 


(b) Since the function g is affine (if you formulate the problem correctly), you can use the 
Levenberg-Marquardt algorithm, modified to handle linear equality constraints, to 
approximately solve the problem. (See page 420.) Find an expression for Df (2). 
You can ignore the fact that the function f is not differentiable at some points. 


(c) Implement the Levenberg—Marquardt algorithm to find weights that minimize down- 
side risk for a given target annualized return. A very reasonable starting point is the 
solution of the standard portfolio optimization problem with the same target return. 
Check your implementation with some simulated or real return data (available on- 
line). Compare the weights, and the risk and downside risk, for the minimum risk 
and the minimum downside risk portfolios. 
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Boolean least squares. The Boolean least squares problem is a special case of the con- 
strained nonlinear least squares problem (19.1), with the form 

minimize  ||Ax — b||? 

subject to z? =1, i=1,...,n, 


where the n-vector x is the variable to be chosen, and the m x n matrix A and the m- 
vector b are the (given) problem data. The constraints require that each entry of x is 
either —1 or +1, i.e., x is a Boolean vector. Since each entry can take one of two values, 
there are 2” feasible values for the vector x. The Boolean least squares problem arises in 
many applications. 


One simple method for solving the Boolean least squares problem, sometimes called the 
brute force method, is to evaluate the objective function || Ax — b||? for each of the 2” 
possible values, and choose one that has the least value. This method is not practical for 
n larger than 30 or so. There are many heuristic methods that are much faster to carry 
out than the brute force method, and approximately solve it, i.e., find an x for which the 
objective is small, if not the smallest possible value over all 2” feasible values of x. One 
such heuristic is the augmented Lagrangian algorithm 19.2. 


(a) Work out the details of the update step in the Levenberg—Marquardt algorithm 
used in each iteration of the augmented Lagrangian algorithm, for the Boolean least 
squares problem. 


(b) Implement the augmented Lagrangian algorithm for the Boolean least squares prob- 


lem. You can choose the starting point «™ as the minimizer of ||Ax — b||?. At 
each iteration, you can obtain a feasible point ge) by rounding the entries of r™) 
to the values +1, i.e., 2) = sign(«‘")). You should evaluate and plot the objective 


value of these feasible points, i.e., || Az”) — b||?. Your implementation can return 
the best rounded value found during the iterations. Try your method on some small 
problems, with n = m = 10 (say), for which you can find the actual solution by the 
brute force method. Try it on much larger problems, with n = m = 500 (say), for 
which the brute force method is not practical. 
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Notation 


Vectors 

Tı 

: n-vector with entries 71,...,%n.- 

In 
(tiei gn) n-vector with entries £1, ..., 2n- 
Ti The ith entry of a vector z. 
Dyis Subvector with entries from r to s. 
0 Vector with all entries zero. 
1 Vector with all entries one. 
ej The ith standard unit vector. 
gry Inner product of vectors x and y. 
||| Norm of vector z. 
rms(z) Root-mean-square value of a vector x. 
avg(x) Average of entries of a vector x. 
std(x) Standard deviation of a vector x. 
dist (x, y) Distance between vectors x and y. 
L(x, y) Angle between vectors x and y. 
rly Vectors x and y are orthogonal. 
Matrices 

Xi > Xin 

: : m x n matrix with entries X11,...,Xmn- 

Xm ais Xmn 
Xij The 7, jth entry of a matrix X. 
Krepi Submatrix with rows r,...,s and columns p...,q. 
0 Matrix with all entries zero. 
I Identity matrix. 
XT Transpose of matrix X. 
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XII Norm of matrix X. 117 
xk (Square) matrix X to the kth power. 186 
x Inverse of (square) matrix X. 202 
XT Inverse of transpose of matrix X. 205 
xt Pseudo-inverse of matrix X. 215 
diag(z) Diagonal matrix with diagonal entries 71,...,2n. 114 


Functions and derivatives 


f:A-~B f is a function on the set A into the set B. 
Vif (z) Gradient of function f : R” > R at z. 
Df (z) Derivative (Jacobian) matrix of function f : R” > R” at z. 


Ellipsis notation 


In this book we use standard mathematical ellipsis notation in lists and sums. We 
write k,...,1 to mean the list of all integers from k to l. For example, 3,...,7 
means 3,4,5,6,7. This notation is used to describe a list of numbers or vectors, 
or in sums, as in =e ai, which we also write as a; +--+ an. Both of these 
mean the sum of the n terms a1, a@2,..., an. 


Sets 


In a few places in this book we encounter the mathematical concept of sets. The 
notation {a1,...,@,} refers to a set with elements a,,...,@,. This is not the same 
as the vector with entries aj,...,@,, which is denoted (a1,...,an). For sets the 
order does not matter, so, for example, we have {1,2,6} = {6,1,2}. Unlike a 
vector, a set cannot have repeated elements. We can also specify a set by giving 
conditions that its entries must satisfy, using the notation {a | condition(x)}, which 
means the set of x that satisfy the condition, which depends on xz. We say that a set 
contains its elements, or that the elements are in the set, using the symbol €, as in 
2 € {1,2,6}. The symbol ¢ means not in, or not an element of, as in 3 ¢ {1,2,6}. 
We can use sets to describe a sum over some elements in a list. The notation 
Xics Vi Means the sum over all x; for which 7 is in the set S. As an example, 
i€{1,2,6} i Means a1 + a2 + a6. 
A few sets have specific names: R is the set of real numbers (or scalars), and 
R” is the set of all n-vectors. So œa € R means that a is a number, and x € R” 
means that x is an n-vector. 
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Complexity 


Here we summarize approximate complexities or flop counts of various operations 
and algorithms encountered in the book. We drop terms of lower order. When 
the operands or arguments are sparse, and the operation or algorithm has been 
modified to take advantage of sparsity, the flop counts can be dramatically lower 
than those given here. 


Vector operations 


In the table below, x and y are n-vectors and a is a scalar. 


ax n 
c+y n 

xy 2n 
æl On 
le- vl 3n 
rms(zx) 2n 
std(x) An 
Z(x,y) 6n 


The convolution axb of an n-vector a and m-vector b can be computed by a special 
algorithm that requires 5(m +n) loga (m + n) flops. 
Matrix operations 


In the table below, A and B are m x n matrices, C is an m x p matrix, x is an 
n-vector, and a is a scalar. 


aA mn 
A+B mn 
Az 2mn 
AC 2mnp 
ATA mn? 


|| All 2mn 
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Factorization and inverses 


In the table below, A is a tall or square m x n matrix, R is an n x n triangular 
matrix, and b is an n-vector. We assume the factorization or inverses exist; in 
particular in any expression involving A~!, A must be square. 


QR factorization of A 2mn? 
Rob n? 
Atb 2n? 
A`! 3n3 
At 3mn? 


The pseudo-inverse At of a wide m x n matrix (with linearly independent rows) 
can be computed in 3m?n flops. 
Solving least squares problems 


In the table below, A is an m x n matrix, C is a wide p x n matrix, and b is an 
m-vector. We assume the associated independence conditions hold. 


minimize || Ax — b||? 2mn? 
minimize || Ax — b||? subject to Cx = d 2(m + p)n? + 2np? 
minimize ||a||? subject to Cz = d 2np” 


Big-times-small-squared mnemonic 


Many of the complexities listed above that involve two dimensions can be remem- 
bered using a simple mnemonic: The cost is order 


(big) x (small)? flops, 


where ‘big’ and ‘small’ refer to big and small problem dimensions. We list some 
examples below. 


e Computing the Gram matrix of a tall m x n matrix requires mn? flops. Here 
m is the big dimension and n is the small dimension. 

e In the QR factorization of an m x n matrix, we have m > n, so m is the big 
dimension and n is the small dimension. The complexity is 2mn? flops. 

e Computing the pseudo-inverse AŤ of an m x n matrix A when A is tall 
(and has independent columns) costs 3mn? flops. When A is wide (and has 
independent rows), it is 3nm? flops. 

e For least squares, we have m > n, so m is the big dimension and n is the small 
dimension. The cost of computing the least squares approximate solution is 
2mn? flops. 

e For the least norm problem, we have p < n, so n is the big dimension and p 
is the small dimension. The cost is 2np? flops. 

e The constrained least squares problem involves two matrices A and C, and 
three dimensions that satisfy m + p > n. The numbers m + p and n are the 


A 
cV The cost of solving 
the constrained least squares problem is 2(m + p)n? + 2np? flops, which is 
between 2(m + p)n? flops and 4(m + p)n? flops, since n < m + p. 


big and small dimensions of the stacked matrix 


C.1 


C.1.1 


Appendix C 


Derivatives and optimization 


Calculus does not play a big role in this book, except in chapters 18 and 19 (on 
nonlinear least squares and constrained nonlinear least squares), where we use 
derivatives, Taylor approximations, and the method of Lagrange multipliers. In 
this appendix we collect some basic material about derivatives and optimization, 
focusing on the few results and formulas we use. 


Derivatives 


Scalar-valued function of a scalar 


Definition. Suppose f : R > R is a real-valued function of a real (scalar) variable. 
For any number zv, the number f(x) is the value of the function, and x is called 
the argument of the function. The number 


(if the limit exists) is called the derivative of the function f at the point z. It gives 
the slope of the graph of f at the point (z, f(z)). We denote the derivative of f at 
z as f'(z). We can think of f’ as a scalar-valued function of a scalar variable; this 
function is called the derivative (function) of f. 


Taylor approximation. Let us fix the number z. The (first order) Taylor approx- 
imation of the function f at the point z is defined as 


F(a) = f(z) + f'(2)(@—-2) 


for any x. Here f(z) is the value of f at z, x — z is the deviation of x from z, 
and f’(z)(a — z) is the approximation of the change in value of the function due 
to the deviation of x from z. Sometimes the Taylor approximation is shown with 
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a second argument, separated by a semicolon, to denote the point z where the ap- 
proximation is made. Using this notation, the left-hand side of the equation above 
is written f(a;z). The Taylor approximation is sometimes called the linearized 
approximation of f at z. (Here linear uses informal mathematical language, where 
affine is sometimes called linear.) The Taylor approximation function f is an affine 
function of x, i.e., a linear function of x plus a constant. 

The Taylor approximation f satisfies f(z; z) = f(z), i.e., at the point z it agrees 
with the function f. For x near z, f(x;z) is a very good approximation of f(z). 
For x not close enough to z, however, the approximation can be poor. 


Finding derivatives. In a basic calculus course, the derivatives of many common 
functions are worked out. For example, with f(x) = x7, we have f’(z) = 2z, and 
for f(x) = e”, we have f’(z) = e*. Derivatives of more complex functions can be 
found using these known derivatives of common functions, along with a few rules 
for finding the derivative of various combinations of functions. For example, the 
chain rule gives the derivative of a composition of two functions. If f(x) = g(h(2)), 
where g and h are scalar-valued functions of a scalar variable, we have 


f'(z) = g'(h(2))h' (2). 
Another useful rule is the derivative of product rule, for f(x) = g(x)h(a), which is 
f(z) = g'(z)h(z) + g(2)h'(2). 


The derivative operation is linear, which means that if f(x) = ag(x)+bh(x), where 
a and b are constants, we have 


f(z) = ag' (2) + bh' (2). 


Knowledge of the derivative of just a few common functions, and a few combination 
rules like the ones above, is enough to determine the derivatives of many functions. 


Scalar-valued function of a vector 


Suppose f : R” — R is a scalar-valued function of an n-vector argument. The 
number f(x) is the value of the function f at the n-vector (argument) z. We 
sometimes write out the argument of f to show that it can be considered a function 
of n scalar arguments, 71,...,2n: 


Fæ) = f(@1,---5@n)- 


Partial derivative. The partial derivative of f at the point z, with respect to its 
ith argument, is defined as 


Of (z) = im: f(a, wey Sa, BAG, Says Zn) — f(z) 
Ox; t30 t 
= lim f(z + tei) — PE). 
t0 t 


(if the limit exists). Roughly speaking the partial derivative is the derivative with 
respect to the ith argument, with all other arguments fixed. 
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Gradient. The partial derivatives of f with respect to its n arguments can be 
collected into an n vector called the gradient of f (at z): 


ə 
SL (2) 


Viz) = 


Taylor approximation. The (first-order) Taylor approximation of f at the point 
z is the function f : R” — R defined as 


flee) = 1O + FE (lar = e) ++ EO an = n) 


for any x. We interpret x; — z; as the deviation of x; from z;, and the term 


SL (2\(ai — zi) aS an approximation of the change in f due to the deviation of x; 


from z;. Sometimes f is written with a second vector argument, as f (x; z), to show 
the point z at which the approximation is developed. The Taylor approximation 
can be written in compact form as 


Feia = f(z) + VF(2)7(@- 2). 


The Taylor approximation f is an affine function of zx. 

The Taylor approximation f agrees with the function f at the point z, i.e., 
f(z;z) = f(z). When all z; are near the associated z;, f(x;z) is a very good 
approximation of f(a). The Taylor approximation is sometimes called the linear 
approximation or linearized approximation of f (at z), even though in general it is 
affine, and not linear. 


Finding gradients. The gradient of a function can be found by evaluating the 
partial derivatives using the common functions and rules for derivatives of scalar- 
valued functions, and assembling the result into a vector. In many cases the result 
can be expressed in a more compact matrix-vector form. As an example let us find 
the gradient of the function 


f(e) = lle? =a} + +23, 


which is the sum of squares of the arguments. The partial derivatives are 


o 
at (z)=2z,, i=1,...,n. 
Ti 
This leads to the very simple vector formula 


V f(z) = 2z. 


(Note the resemblance to the formula for the derivative of the square of a scalar 
variable.) 

There are rules for the gradient of a combination of functions similar to those 
for functions of a scalar. For example if f(a) = ag(x) + bh(a), we have 


V f(z) = aVg(z) + bVA(z). 
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Vector-valued function of a vector 


Suppose f : R” — R” is a vector-valued function of a vector. The n-vector x is 
the argument; the m-vector f(x) is the value of the function f at x. We can write 
out the m components of f as 


filz) 
f(@) = 
fm(2) 
where f; is a scalar-valued function of x = (x1,..., £n). 


Jacobian. The partial derivatives of the components of f(x) with respect to the 
components of x, evaluated at z, are arranged into an m x n matrix denoted Df (z), 
called the derivative matrix or Jacobian of f at z. (In the notation Df(z), the 
D and f go together; Df does not represent, say, a matrix-vector product.) The 
derivative matrix is defined by 


Of; 
Df (Zz) = Fal) Coleg ite “FS let 
The rows of the Jacobian are V f;(z)", for i=1,...,m. For m = 1, i.e., when 


f is a scalar-valued function, the derivative matrix is a row vector of size n, the 
transpose of the gradient of the function. The derivative matrix of a vector-valued 
function of a vector is a generalization of the derivative of a scalar-valued function 
of a scalar. 


Taylor approximation. The (first-order) Taylor approximation of f near z is given 


by 
X ð i o i 
fle = MOE Oma) +++ F2 len en) 
= filz) + Viilz)" (z — 2), 
for i =1,...,m. We can express this approximation in compact notation as 


F(z) = f(z) + Df) @ - 2). 


For x near z, f(x) is a very good approximation of f(x). As in the scalar case, the 
Taylor approximation is sometimes written with a second argument as f (x; z) to 
show the point z around which the approximation is made. The Taylor approxi- 
mation f is an affine function of x, sometimes called a linear approximation of f, 
even though it is not, in general, a linear function. 


Finding Jacobians. We can always find the derivative matrix by calculating par- 
tial derivatives of the entries of f with respect to the components of the argument 
vector. In many cases the result simplifies using matrix-vector notation. As an 
example, let us find the derivative of the (scalar-valued) function 


h(x) = [FE = file)? +--+ fm, 


C.2 
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where f : R” + R™. The partial derivative with respect to x;, at z, is 


Oh Of Ofm 
Jz, © = 2fi(2)5 O Prset 2fm(z) a (2): 
vj Tj Tj 
Arranging these to form the row vector Dh(z), we see we can write this using 
matrix multiplication as 


Dh(z) = 2f(z)? Df (2). 


The gradient of h is the transpose of this expression, 
Vh(z) = 2Df(z)? f(z). (C.1) 


(Note the analogy to the formula for the scalar-valued function of a scalar variable 
h(x) = f(x)", which is h’(z) = 2f'(z)f(z).) 

Many of the formulas for derivatives in the scalar case also hold for the vector 
case, with scalar multiplication replaced with matrix multiplication (provided the 
order of the terms is correct). As an example, consider the composition function 
f(x) = g(h(x)), where h : R” > R* and g : Rë => R”. The Jacobian or derivative 
matrix of f at z is given by 


Df(z) = Dg(h(z))Dh(z). 


(This is matrix multiplication; compare it to composition formula for scalar-valued 
functions of scalars given above.) This chain rule is described on page 184. 


Optimization 


Derivative condition for minimization. Suppose h is a scalar-valued function of 
a scalar argument. If ĉ minimizes h(x), we must have h’(#) = 0. This fact is easily 
understood: If h'(ĉ) 4 0, then by taking a point & slightly less than ĉ (if h’(£) > 0) 
or slightly more than ĉ (if h’(@) < 0), we would obtain A(Z) < h(#), which shows 
that @ does not minimize h(x). This leads to the classic calculus-based method 
for finding a minimizer of a function f: Find the derivative, and set it equal to 
zero. One subtlety here is that there can be (and generally are) points that satisfy 
h'(z) = 0, but are not minimizers of h. So we generally need to check which of the 
solutions of h’(z) = 0 are in fact minimizers of h. 


Gradient condition for minimization. This basic calculus-based method for find- 
ing a minimizer of a scalar-valued function can be generalized to functions with 
vector arguments. If the n-vector ĉ minimizes h : R” > R, then we must have 


Oh 
Ox; 


(4)=0, t=1,...,n. 


In vector notation, we must have 


Vh(z) = 0. 
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Like the case of a scalar argument, this is easily seen to hold if ¢ minimizes h. Also 
as in the case of a scalar argument, there can be points that satisfy Vh(z) = 0 but 
are not minimizers of h. So we need to check if points found this way are in fact 
minimizers of h. 


Nonlinear least squares. As an example, consider the nonlinear least squares 
problem, with objective h(x) = ||f(x)||?, where f : R” > R™. The optimality 
condition Vh(z) = 0 is 

2D f(é)* F) = 0 


(using the expression (C.1) for the gradient, derived above). This equation will 
hold for a minimizer, but there can be points that satisfy the equation, but are not 
solutions of the nonlinear least squares problem. 


Lagrange multipliers 


Constrained optimization. We now consider the problem of minimizing a scalar- 
valued function h : R” — R, subject to the requirements, or constraints, that 


gi(z) =0, ..., g(x) =0 
must hold, where g; : R” — R are given functions. We can write the constraints 
in compact vector form g(x) = 0, where g(x) = (gi(x),...,9p()), and express the 
problem as 


minimize f(x) 
subject to g(x) =0. 


We seek a solution of this optimization problem, i.e., a point ĉ that satisfies g(ĉ) = 
0 (i.e., is feasible) and, for any other x that satisfies g(x) = 0, we have h(a) > h(#). 

The method of Lagrange multipliers is an extension of the derivative or gradient 
conditions for (unconstrained) minimization, that handles constrained optimization 
problems. 


Lagrange multipliers. The Lagrangian function associated with the constrained 
problem is defined as 


E(z,z) = h(x) + 2191(4) +--+ + Zpgp(2) 
= A(x) +g(x)"z, 


with arguments x (the original variable to be determined in the optimization prob- 
lem), and a p-vector z, called the (vector of) Lagrange multipliers. The Lagrangian 
function is the original objective, with one term added for each constraint func- 
tion. Each term is the constraint function value multiplied by z;, hence the name 
multiplier. 


C.3 Lagrange multipliers 
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KKT conditions. The KKT conditions (named for Karush, Kuhn, and Tucker) 
state that if ĉ is a solution of the constrained optimization problem, then there is 
a vector Z that satisfies 


Ol. 7a. s OL 
a US bee AD ax, 


(2,2)=0, i=1,...,p. 


(This is provided the rows of Dg(#) are linearly independent, a technical condition 
we ignore.) As in the unconstrained case, there can be pairs x, z that satisfy the 
KKT conditions but ĉ is not a solution of the constrained optimization problem. 

The KKT conditions give us a method for solving the constrained optimization 
problem that is similar to the approach for the unconstrained optimization problem. 
We attempt to solve the KKT equations for ĉ and 2; then we check to see if any 
of the points found are really solutions. 

We can simplify the KKT conditions, and express them compactly using matrix 
notation. The last p equations can be expressed as g;(#) = 0, which we already 
knew. The first n can be expressed as 


VaL(ĉ, 2) = 0, 


where V, denotes the gradient with respect to the x; arguments. This can be 
written as 


Vh(2) + £1V 91 (2) + +++ + SpGp() = VA(&) + Dg(#)? 2 = 0. 
So the KKT conditions for the constrained optimization problem are 
Vh(&) + Dg(#)72=0, g(&) =0. 
This is the extension of the gradient condition for unconstrained optimization to 


the constrained case. 


Constrained nonlinear least squares. As an example, consider the constrained 
least squares problem 

minimize || f(x)||? 

subject to g(x) = 0, 


where f : R” + R™ and g : R” > R”. Define h(x) = ||f(x)||?. Its gradient at ĉ 
is 2D f(£)? f (ê) (see above) so the KKT conditions are 
OD F(a)" FG) + Dol#)"2=0, gê) =0. 


These conditions will hold for a solution of the problem (assuming the rows of 
Dg(#) are linearly independent). But there can be points that satisfy them and 
are not solutions. 


Appendix D 


Further study 


In this appendix we list some further topics of study that are closely related to the 
material in this book, give a different perspective on the same material, complement 
it, or provide useful extensions. The topics are organized into groups, but the 
groups overlap, and there are many connections between them. 


Mathematics 


Probability and statistics. In this book we do not use probability and statistics, 
even though we cover multiple topics that are traditionally addressed using ideas 
from probability and statistics, including data fitting and classification, control, 
state estimation, and portfolio optimization. Further study of many of the topics in 
this book requires a background in basic probability and statistics, and we strongly 
encourage you to learn this material. (We also urge you to remember that topics 
like data fitting can be discussed without ideas from probability and statistics.) 


Abstract linear algebra. This book covers some of the most important basic ideas 
from linear algebra, such as linear independence. In a more abstract course you 
will learn about vector spaces, subspaces, nullspace, and range. Eigenvalues and 
singular values are useful topics that we do not cover in this book. Using these 
concepts you can analyze and solve linear equations and least squares problems 
when the basic assumption used in this book (i.e., the columns of some matrix are 
linearly independent) does not hold. Another more advanced topic that arises in 
the solution of linear differential equations is the matrix exponential. 


Mathematical optimization. This book focuses on just a few optimization prob- 
lems: Least squares, linearly constrained least squares, and their nonlinear ex- 
tensions. In an optimization course you will learn about more general optimiza- 
tion problems, for example ones that include inequality constraints. Convex op- 
timization is a particularly useful generalization of the linearly constrained least 
squares problem. Convex optimization problems can be solved efficiently and non- 
heuristically, and include a wide range of practically useful problems that arise in 
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many application areas, including all of the ones we have seen in this book. We 
would strongly encourage you to learn convex optimization, which is widely used in 
many applications. It is also useful to learn about methods for general non-convex 
optimization problems. 


Computer science 


Languages and packages for linear algebra. We hope that you will actually 
use the ideas and methods in this book in practical applications. This requires 
a good knowledge and understanding of at least one of the computer languages 
and packages that support linear algebra computations. In a first introduction 
you can use one of these packages to follow the material of this book, carrying 
out numerical calculations to verify our assertions and experiment with methods. 
Developing more fluency in one or more of these languages and packages will greatly 
increase your effectiveness in applying the ideas in this book. 


Computational linear algebra. In a course on computational or numerical linear 
algebra you will learn more about floating point numbers and how the small round- 
off errors made in numerical calculations affect the computed solutions. You will 
also learn about methods for sparse matrices, and iterative methods that can solve 
linear equations, or compute least squares solutions, for extremely large problems 
such as those arising in image processing or in the solution of partial differential 
equations. 


Applications 


Machine learning and artificial intelligence. This book covers some of the basic 
ideas of machine learning and artificial intelligence, including a first exposure to 
clustering, data fitting, classification, validation, and feature engineering. In a fur- 
ther course on this material, you will learn about unsupervised learning methods 
(like k-means) such as principal components analysis, nonnegative matrix factor- 
ization, and more sophisticated clustering methods. You will also learn about more 
sophisticated regression and classification methods, such as logistic regression and 
the support vector machine, as well as methods for computing model parameters 
that scale to extremely large scale problems. Additional topics might include fea- 
ture engineering and deep neural networks. 


Linear dynamical systems, control, and estimation. We cover only the basics of 
these topics; entire courses cover them in much more detail. In these courses you 
will learn about continuous-time linear dynamical systems (described by systems 
of differential equations) and the matrix exponential, more about linear quadratic 
control and state estimation, and applications in aerospace, navigation, and GPS. 
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Finance and portfolio optimization. Our coverage of portfolio optimization is 
basic. In a further course you would learn about statistical models of returns, 
factor models, transaction costs, more sophisticated models of risk, and the use of 
convex optimization to handle constraints, for example a limit on leverage, or the 
requirement that the portfolio be long-only. 


Signal and image processing. Traditional signal processing, which is used through- 
out engineering, focuses on convolution, the Fourier transform, and the so-called 
frequency domain. More recent approaches use convex optimization, especially in 
non-real-time applications, like image enhancement or medical image reconstruc- 
tion. (Even more recent approaches use neural networks.) You will find whole 
courses on signal processing for a specific application area, like communications, 
speech, audio, and radar; for image processing, there are whole courses on mi- 
croscopy, computational photography, tomography, and medical imaging. 


Time series analysis. ‘Time series analysis, and especially prediction, plays an 
important role in many applications areas, including finance and supply chain op- 
timization. It is typically taught in a statistics or operations research course, or as 
a specialty course in a specific area such as econometrics. 
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k-means, 74 functions, 246 
least norm, 351 orthonormal, 96 
least squares, 231 B (beta), 251 
Levenberg—Marquardt, 386 bi-criterion least squares, 311 
modified Gram-Schmidt, 102 bi-linear interpolation, 162 
Newton, 388 big-times-small-squared rule, 333, 442 
penalty, 421 bill of materials, 12 
QR factorization, 190 birth rate, 165, 219 
solving linear equations, 208 bit, 22 
aligned vectors, 58 block 
a (alpha), 251 matrix, 109, 179 
angle, 56 vector, 4 
acute, 58 Boeing 747, 379 
document, 58 Boole, George, 10 
obtuse, 58 Boolean 
orthogonal, 58 classification, 285 
annualized return and risk, 359 features, 38, 281 
anti-aligned vectors, 58 least squares, 435 
approximation vector, 10, 26, 87 
affine, 35 Bowie, David, 84 
least squares, 226 byte, 22, 122 
Taylor, 35 
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Cauchy—Schwarz inequality, 56, 68 
centroid, 74 
chain graph, 136, 317 
chain rule, 184, 444, 447 
channel equalization, 146 
Chebyshev inequality, 47, 54, 64, 305 
Chebyshev, Pafnuty, 47 
chemical 
equilibrium, 384 
reaction balance, 154 
circular difference matrix, 319 
circulation, 134 
classification, 285 
Boolean, 285 
handwritten digits, 290, 404 
iris flower, 289 
multi-class, 297 
classifier 
least squares, 288 
one-versus-others, 299 
closed-loop, 186 
cluster centroid, 74 
clustering, 69 
digits, 79 
objective, 72 
optimal, 73 
co-occurrence, 20 
coefficients 
linear equations, 152 
matrix, 107 
vector, 3 
colon notation, 4 
color vector, 6 
column-major, 159 
communication channel, 138 
compartmental system, 174 
completing the square, 242 
complexity, 22 
k-means algorithm, 79 
Gram-Schmidt algorithm, 102 
matrix-matrix multiply, 182 
matrix-vector multiplication, 123 
vector operations, 24 
compliance matrix, 150 
computer representation 
matrix, 122 
vector, 22 
confusion matrix, 287 
conservation of mass, 156 


constrained least squares, 339 
solution, 344 
sparse, 349 
constrained optimization, 448 
KKT conditions, 449 
contingency table, 111 
control, 314 
closed-loop, 186 
linear quadratic, 366 
nonlinear, 425 
state feedback, 185 
controllability matrix, 195 
convolution, 136 
correlation coefficient, 60, 251 
covariance matrix, 193 
cross product, 159 
cross-validation, 264 
efficient, 284 
currency exchange rate, 26, 125 
customer purchase 
matrix, 111 
vector, 10 
cycle, 145, 195 


data fitting, 245 
data matrix, 112, 116 
de-meaned vector, 52 
de-meaning, 149 
de-trended, 252 
de-tuning, 325 
death rate, 165, 219 
decision threshold, 294 
deformation, 150 
demand, 150 
elasticity matrix, 150 
shaping, 315 
dependent variable, 38 
dependent vectors, 89 
derivative, 35, 443 
chain rule, 184, 444, 447 
partial, 444 
diag, 114 
diagonal matrix, 114 
diet, 160 
difference matrix, 119, 317 
difference of vectors, 11 
difference vector, 26 
diffusion, 155 
digits, 79 
dilation, 129 
dimension 
matrix, 107 
vector, 3 
directed graph, 112, 132, 186 
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Frobenius norm, 118 
Frobenius, Ferdinand Georg, 118 
function 
affine, 32, 149 
argument, 29 
basis, 246 
composition, 183 
inner product, 30 
linear, 30, 147 
notation, 29 
objective, 226, 419 
rational, 160, 218, 282 
reversal, 148 
running sum, 149 
sigmoid, 390, 413 
sum, 159 


Galton, Sir Francis, 279 

Game of Thrones, 84 

Gauss, Carl Friedrich, 102, 161, 207, 225, 
386 

Gauss—Newton algorithm, 386 

generalization, 260 

generalized additive model, 271 

gigabyte, 23 
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gigaflop, 23 


global positioning system, see GPS 


gone bust, 358 
GPS, 373, 386 
gradient, 228, 445 


Gram matrix, 181, 214, 229, 250, 318, 332, 


378 
Gram, Jørgen Pedersen, 97 
Gram-Schmidt algorithm, 97, 190 
complexity, 102 
modified, 102 
graph, 112, 132, 186 
chain, 136 
circle, 145 
cycle, 145, 195 
social network, 116 
tree, 145 
grayscale, 9 
group representative, 72 


handwritten digits, 79, 290 
heat flow, 155 

hedging, 62 

Hestenes, Magnus, 422 
histogram vector, 9, 50 
homogeneous equations, 153 


house price regression, 39, 258, 265, 274 


identity matrix, 113 
illumination, 234 


image 
matrix, 110 
vector, 9 


impulse response, 138 
imputing missing entries, 86 
incidence matrix, 132, 171 


independence-dimension inequality, 91 


independent vectors, 89 
index 
column, 107 
range, 4 
row, 107 
vector, 3 
inequality 
Cauchy—Schwarz, 56, 68 
Chebyshev, 47, 54 
independence-dimension, 91 
triangle, 46, 49, 57 
inner product, 19, 178 
function, 30 
matrices, 192 
input, 164 
input-output 
matrix, 157 
system, 140, 280, 314 
intercept, 38 


interpolation, 144, 154, 160, 162, 210, 218, 


354 
inverse 
left, 199 
matrix, 202 
Moore-Penrose, 215 
pseudo, 214, 337 
right, 201 
inversion, 316 
Tikhonov, 317 
invertible matrix, 202 
iris flower classification, 289, 301 


iterative method for least squares, 241 


Jacobi, Carl Gustav Jacob, 151 
Jacobian, 151, 446 


k-means 
algorithm, 74 
complexity, 79 
features, 273 
Kalman, Rudolph, 374 
Karush, William, 345 
Karush—Kuhn—Tucker, see KKT 
Kirchhoff’s current law, 156 
Kirchhoff, Gustav, 156 
KKT 
conditions, 345, 449 
matrix, 345 
Kuhn, Harold, 345 
Kyoto prize, 374 


label, 38 
Lagrange 
multipliers, 344, 448 
polynomial, 211 
Lagrange, Joseph-Louis, 211 
Lambert function, 412 
Lambert, Johann Heinrich, 412 
Laplace, Pierre-Simon, 192 
Laplacian matrix, 192 


Laplacian regularization, 135, 317, 324 


least squares, 225 
bi-criterion, 311 
Boolean, 435 
classifier, 288 
data fitting, 245 
iterative method, 241 
multi-objective, 309 
nonlinear, 381 
recursive, 242 
residual, 225 
solution method, 231 
sparse, 232 

LeCun, Yann, 79 

left inverse, 199 

Legendre, Adrien-Marie, 225 
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Leonardo of Pisa, 175 confusion, 287 
Leontief input-output model, 157, 174 controllability, 195 
Leontief, Wassily, 157 covariance, 193 
Levenberg, Kenneth, 391 data, 112, 116 
Levenberg—Marquardt algorithm, 386 demand elasticity, 150 
leverage, 358 diagonal, 114 
Likert scale, 71, 270, 305 difference, 119, 317 
Likert, Rensis, 71 dimensions, 107 
line, 18, 65, 365 document-term, 116 
segment, 18 dynamics, 163 
linear elasticity, 336, 394 
combination, 17 elements, 107 
dynamical system, 163 equality, 107 
equations, 147, 152 feature, 152 
function, 30, 147 Gram, 181, 214, 229, 250, 318, 332, 
least squares problem, 226 378 
quadratic control, 366 graph, 112 
sparse equations, 210 identity, 113 
versus affine, 33 image, 110 
linear dynamical system, 163 incidence, 132, 171 
closed-loop, 186 inner product, 192 
state feedback, 185 inverse, 199, 202, 209 
linearity, 147 invertible, 202 
linearly independent Jacobian. 151, 446 
row vectors, 115 KKT 345 , 
vectors, 89 


Laplacian, 192 


link, 133 

? least squares, 233 
Lloyd, Stuart, 74 left inverse, 199 
loan, 8, 93 


Leontief input-output, 157 
lower triangular, 114 
multiplication, 177 
negative power, 205 
nonsingular, 202 

norm, 117 

orthogonal, 189, 204 
permutation, 132, 197 
population dynamics, 219 


location vector, 6 
logarithmic spacing, 314 
logistic regression, 288 
long-only portfolio, 358 
look-ahead, 266 

loss function, 402 

loss leader, 26 

lower triangular matrix, 114 


market power, 186 
clearing, 14 projection, 240 
return, 251 pseudo-inverse, 214, 229 
segmentation, 70 relation, 112 
Markov model, 164, 175 resistance, 157 
Markov, Andrey, 164 return, 110 
Markowitz, Harry, 357 reverser, 131, 148 
Marquardt, Donald, 391 rotation, 129, 191 
mass, 169 running sum, 120 
matrix, 107 second difference, 183 
addition, 116 singular, 202 
adjacency, 112, 133, 186 sparse, 114 
asset return, 110 square, 108 
block, 109, 179 squareroot, 186, 194 
cancellation, 217 stacked, 109 
circular difference, 319 state feedback gain, 185 
coefficients, 107 subtraction, 116 
compliance, 150 sum, 116 


computer representation, 122 symmetric, 116 
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tall, 108 

Toeplitz, 138, 280, 316 
trace, 192 

transpose, 115 
triangular, 114, 206 
triple product, 182 
upper triangular, 114 


Vandermonde, 121, 127, 154, 210, 256 


vector multiplication, 118 
wide, 108 
zero, 113 
matrix-vector product, 147 
mean, 20, 21 
mean return, 54 
mechanical equilibrium, 384 
minimum mean square error, see MMSE 
missing entries, 86 
mixing audio, 18 
mixture of vectors, 17 
MMSE, 247 
MNIST, 79, 290, 404 
model 
nonlinear, 386, 399 
over-fit, 261 
parameter, 246 
stratified, 272, 336 
validation, 260 
modified Gram-Schmidt algorithm, 102 
monochrome image, 9 
Moore’s law, 280 
Moore, Eliakim, 215 
Moore, Gordon, 280 
Moore-Penrose inverse, 215 
motion, 169 
moving average, 138 
u (mu), 20, 53 
multi-class classification, 297 
multi-objective least squares, 309 
multiplication 
matrix-matrix, 177 
matrix-vector, 118 
scalar-matrix, 117 
scalar-vector, 15 
sparse matrix, 182 


Nash equilibrium, 385 

Nash, John Forbes Jr., 385 
navigation, 373 

nearest neighbor, 50, 63, 65, 66, 73, 306 
net present value, see NPV 

Netflix, 284 

network, 133 

neural network, 273, 413 

Newton algorithm, 388 

Newton’s law of motion, 42, 169, 343 
Newton, Isaac, 42, 386 

nnz (number of nonzeros), 6, 114 


Nobel prize 
Leontief, 158 
Markowitz, 357 
Nash, 385 
node, 112 
nonlinear 
control, 425 
equations, 381 
least squares, 381 
model fitting, 386, 399 
nonnegative vector, 27 
nonsingular matrix, 202 
norm, 45 
Euclidean, 45 
Frobenius, 118 
matrix, 117 
weighted, 68 
normal equations, 229 
notation 
function, 29 
overloading, 5 
NPV, 22, 94, 103 
number 
floating point, 22 
of nonzeros, 114 
nutrients, 160, 352 


objective 
clustering, 72 
function, 226, 419 
observations, 245 
obtuse angle, 58 
occurrence vector, 10 
offset, 38 
one-hot encoding, 270 
one-versus-others classifier, 299 
ones vector, 5 
open-loop, 368 
optimal clustering, 73 
optimal trade-off curve, 311 
optimality condition 
least squares, 229 
nonlinear least squares, 382 
optimization, 447 
constrained, 448 
order, 24 
orthogonal 
distance regression, 400 
matrix, 189, 204 
vectors, 58 
orthogonality principle, 231 
orthonormal 
basis, 96 
expansion, 96 
row vectors, 115 
vectors, 95 
out-of-sample validation, 261 
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outcome, 245 

outer product, 178 
over-determined, 153, 382 
over-fit, 261 

overloading, 5 


parallelogram law, 64 
parameter 
model, 246 
regularization, 328 
Pareto optimal, 311, 360 
Pareto, Vilfredo, 311 
partial derivative, 35, 444 
path, 133, 186 
penalty algorithm, 421 
Penrose, Roger, 215 
permutation matrix, 132, 197 
pharmaco-kinetics, 174 
phugoid mode, 379 
piecewise-linear fit, 256 
pixel, 9 
polynomial 
evaluation, 21, 120 
fit, 255 
interpolation, 154, 160, 210 
Lagrange, 211 
population dynamics, 164, 188 
portfolio 
gone bust, 358 
leverage, 358 
long-only, 358 
optimization, 357 
return, 22, 120, 358 
risk, 359 
sector exposure, 161 
trading, 14 
value, 22 
vector, 7 
weights, 357 
potential, 135, 156 
Powell, Michael, 422 
power of matrix, 186 
precision, 287 
prediction error, 50, 152, 246 
price 
elasticity, 150, 336 
equilibrium, 384 
vector, 21 
probability, 21 
product 
block matrix, 179 
cross, 159 
dot, 19 
inner, 19, 178 
matrix-matrix, 177 
matrix-vector, 147 
outer, 178 


projection, 65, 129, 144, 240 
proportions, 7 

pseudo-inverse, 214, 229, 337 
push-through identity, 218, 333 
Pythagoras of Samos, 60 


QR factorization, 189, 206, 231, 348, 351 


quadrature, 161, 220 


random features, 273, 293, 406, 409 


Raphson, Joseph, 388 
rational function, 160, 218, 282 
recall rate, 287 


receiver operating characteristic, see ROC 


recommendation engine, 85 
recursive least squares, 242 
regression, 151, 257 

house price, 39, 258 

logistic, 288 

model, 38 

to the mean, 279 
regressors, 38 
regularization, 364 

parameter, 328 

path, 328, 332 

terms, 314 
relation, 112 

friend, 116 
residual, 225, 381, 419 
residual sum of squares, see RSS 
resistance matrix, 157 
return, 8, 54 

annualized, 359 

matrix, 110 

vector, 22 
reversal function, 148 
reverser matrix, 131, 148 
RGB, 6 
Richardson, Lewis, 241 
ridge regression, 325 
right inverse, 201 
right-hand side, 152 
risk, 54, 359 
risk-free asset, 358 
RMS, 46 

deviation, 48 

prediction error, 50 
rms (root-mean-square), 46 
ROC, 294 
root-mean-square, see RMS 
rotation, 129, 191 
round-off error, 23, 102 
row vector, 108 

linearly independent, 115 
running sum, 120, 149 


samples, 245 
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sampling interval, 170 
scalar, 3 
scalar-matrix multiplication, 117 
scalar-vector multiplication, 15 
scaling, 129 
Schmidt, Erhard, 97 
Schwarz, Hermann, 57 
score, 21 
seasonal component, 255 
seasonally adjusted time series, 255 
second difference matrix, 183 
sector exposure, 27, 161, 352 
segment, 18 
sensitivity, 287 
shaping demand, 315 
short position, 7, 22 
shrinkage, 325 
o (sigma), 53 
sigmoid function, 390, 413 
sign function, 289 
signal, 7 
flow graph, 413 
Simpson’s rule, 161 
Simpson, Thomas, 161 
singular matrix, 202 
sink, 134 
skewed classifier, 294 
slice, 4, 131 
social network graph, 116 
source, 134 
sparse 
constrained least squares, 349 
least squares, 232 
linear equations, 210, 350 
matrix, 114 
matrix multiplication, 182 
QR factorization, 190 
vector, 6, 24 
specificity, 287 
spherical distance, 58 
spline, 341 
square 
matrix, 108 
system of equations, 153, 382 
squareroot of matrix, 194 


stacked 
matrix, 109 
vector, 4 


standard deviation, 52, 248 
standardization, 56 

standardized features, 269 

state, 163 

state feedback control, 185, 335, 371 
std (standard deviation), 52 
steganography, 354 

Steinhaus, Hugo, 74 


stemming, 10, 82 
stoichiometry, 162 
stop words, 10 
straight-line fit, 249 
stratified model, 272, 336 
subadditivity, 46 
submatrix, 109 
subset vector, 10 
subtraction 

matrix, 116 

vector, 11 
subvector, 4 
sum 

linear function, 159 

matrix, 116 

of squares, 20, 45, 247 

vector, 11 
superposition, 30, 147 
supply chain dynamics, 171 
support vector machine, 288 
survey response, 71 
symmetric matrix, 116 


tall matrix, 108 
Taylor approximation, 35, 64, 151, 185, 387, 
443 
Taylor, Brook, 36 
term frequency inverse document frequency, 
see TFIDF 

test data set, 261 

TFIDF, 273 

thermal resistance, 157 

Tikhonov, Andrey, 317 

time series 
auto-regressive model, 259 
de-trended, 252 
prediction validation, 266 
seasonally-adjusted, 255 
smoothing, 138 
vector, 7 

time-invariant, 163 

Toeplitz matrix, 138, 280, 316 

Toeplitz, Otto, 138 

topic discovery, 70, 82 

trace, 192 

tracking, 368 

trade list, 14 

trade-off curve, 311 

training data set, 261 

trajectory, 163 

transpose, 115 

tree, 145 

trend line, 252 

triangle inequality, 46, 49, 57, 118 

triangular matrix, 114, 206 

trim conditions, 379 

true negative rate, 287 
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true positive rate, 287 
Tucker, Albert, 345 


uncorrelated, 60 
under-determined, 153, 382 
unit vector, 5 

units for vector entries, 51, 63 
up-conversion, 144 

upper triangular matrix, 114 


validation, 260, 314 
classification, 288 
limitations, 268 
set, 261 
time series prediction, 266 

Vandermonde matrix, 121, 127, 154, 210, 

256 

Vandermonde, Alexandre-Théophile, 121 

variable, 225 

vector, 3 
addition, 11 
affine combination, 17 
aligned, 58 
angle, 56 
anti-aligned, 58 
AR model, 164, 283 
basis, 91 
block, 4 
Boolean, 10, 26, 87 
cash flow, 8, 93 
clustering, 69 
coefficients, 3 
color, 6 
components, 3 
computer representation, 22 
correlation coefficient, 60 
customer purchase, 10 
de-meaned, 52 
dependent, 89 
difference, 26 
dimension, 3 
distance, 48 
entries, 3 
equality, 3 
feature, 10, 21, 245 
histogram, 9 
image, 9 
independence, 89 
inner product, 19 
large, 45 
linear combination, 17 
linear dependence, 89 
linear independence, 89 
location, 6 
matrix multiplication, 118 
missing entry, 85 
mixture, 17 


nonnegative, 27 
occurrence, 10 
ones, 5 
orthogonal, 58 
orthonormal, 95 
outer product, 178 
portfolio, 7 
price, 21 
probability, 21 
proportions, 7 
quantities, 7 
return, 22 
RMS deviation, 48 
RMS value, 46 
row, 108 
slice, 4 
small, 45 
sparse, 6, 24 
stacked, 4 
standardization, 56 
subset, 10 
sum, 11 
time series, 7 
unit, 5 
units for entries, 51, 63 
weight, 21, 38 
word count, 9, 87 
zero, 5 

vertex, 112 

video, 9 


warm start, 393 
way-point constraint, 371 
weather zones, 71 
weight vector, 38 
weighted 

average, 17, 334 

Gram matrix, 334 

norm, 68 

sum, 30 

sum of squares, 310 
wide matrix, 108 
Wikipedia, 51, 82 
Wilkinson, James H., 114 
Winsor, Charles P., 270 
winsorized feature, 269 
word count 

TFIDF, 273 

vector, 9, 50, 87 


z-score, 56, 67, 269 
Zero 
matrix, 113 
vector, 5 
ZIP code, 71, 274 


